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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to 
communications computers. The invention has application, by way of non-limiting example, in 
improving the capacity of cellular phone base stations. 

Code-division multiple access (CDMA) is used increasingly in wireless communications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 
in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 
slots, respectively. 

A limiting factor in CDMA communication and, particularly, in so-called direct sequence 
CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 
multiple cellular phone users in the same geographic area using their phones at the same time. 
This is referred to as multiple access interference (MAI). It has effect of limiting the capacity of 
cellular phone base stations, since interference may exceed acceptable levels - driving service 
quality below acceptable levels — when there are too many users. 

A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 



1 



EV 093 931 868 US 
Page No. 3 



An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 

A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as require minimal changes in existing wireless communications 
infrastructure. 

A still further object of the invention is to provide methods and apparatus for executing 
multi-user detection and related algorithms in real-time. 



A still further object of the invention is to provide such methods and apparatus as manage 
faults for high-availability. 
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Summary of the Invention 

These and other objects are met by the invention which provides, in one aspect, a 
communications computer, referred to as the "MCW-1" (among other terms) in the materials that 
follow, and methods of operation thereof. An overview of that system is provided in the section 
entitled "Communications Computer," beginning on page 5 hereof. A more complete 
understanding of its implementation may be attained by reference to the other attached materials. 

In view of those materials, aspects of the invention include, but are not limited to the 
following: 

• architecture and operation of a communications computer for a wireless 
communications system, including a fully programmable computer inserted 
into base transceiver station (BTS) to support compute-intensive and/or highly 
data-dependent functions such as adaptive processing and interference 
cancellation 

These and other aspects of the invention (including utilization of the aforementioned 
methods and aspects for other than wireless communications and/or interference cancellation) 
are evident in the materials that follow. 
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Detailed Description of the Invention 

See the attached materials on pages 5-11 hereof, providing description and block 
diagram of a preferred structure and operation of a communications computer for wireless 
applications according to the invention. 

The aforementioned materials pertain to improvements on the methods and apparatus 
described in United States Provisional Application Serial No. 60/275,846, filed March 14, 2001, 
entitled IMPROVED WIRELESS COMMUNICATIONS SYSTEMS AND METHODS and 
United States Provisional Application Serial No. 60/289,600, filed May 7, 2001, entitled 

§** IMPROVED WIRELESS COMMUNICATIONS SYSTEMS AND METHODS USING LONG- 

CODE MULTI-USER DETECTION, the teachings of both of which are incorporated herein by 

ifl reference and copies of at least portions of which are attached hereto. Those copies bears the 

^ U.S. Postal Service Express Mail label number of both prior filings, as well as that of this filing 

O (the latter being referred to as the "New Exp. Mail Label No."). 
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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to methods 
and apparatus for interference cancellation in code-division multiple access communications. 
The invention has application, by way of non-limiting example, in improving the capacity of 
cellular phone base stations. 

Code-division multiple access (CDMA) is used increasingly in wireless communications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 
in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 
slots, respectively. 

A limiting factor in CDMA communication and, particularly, in so-called direct sequence 
CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 
multiple cellular phone users in the same geographic area using their phones at the same time. 
This is referred to as multiple access interference (MAI). It has effect of limiting the capacity of 
cellular phone base stations, since interference may exceed acceptable levels — driving service 
quality below acceptable levels — when there are too many users. 

A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 
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An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 

A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as require minimal changes in existing wireless communications 
infrastructure. 

A still further object of the invention is to provide methods and apparatus for executing 
multi-user detection and related algorithms in real-time. 

A still further object of the invention is to provide such methods and apparatus as manage 
faults for high-availability. 



EV 093 931 868 US 
Page No. 16 



Summary of the Invention 

These and other objects are met by the invention which provides, in one aspect, a 
wireless communications system, referred to as the "MCW-1 " (among other terms) in the 
materials that follow, and methods of operation thereof. An overview of that system is provided 
in the document entitled "Software Architecture of the MCW-1 MUD Board," immediately 
following this Summary. A more complete understanding of its implementation may be attained 
by reference to the other attached materials. 

In view of those materials, aspects of the invention include, but are not limited to, the 
following: 

• methods and apparatus for long-code multi-user detection (MUD) in a wireless 
communications system. 

These and other aspects of the invention (including utilization of the aforementioned 
methods and aspects for other than wireless communications and/or interference cancellation) 
are evident in the materials that follow. 
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Detailed Description of the Invention 

See the attached materials on pages 5-12 hereof, providing a block diagram of a 
preferred algorithm for long code MUD which includes identification of (roughly) how many 
GOPS are involved in each major function; a diagram showing interfaces between a long code 
MUD processing card according to the invention and a modem, e.g., of the type provided by 
Motorola (or another supplier of such components); and two block diagrams of the same 
BASELINE 0 board hardware architecture at a top level identifiying the processing nodes. The 
attached diagram entitled "Long-code Mapping to Hardware" illustrates support of 64 users for 
long code MUD and shows parts of the long code MUD algorithm supported by each processing 

,U node. The diagram entitled "Short-code Mapping to Hardware" illustrates support of 128 users 

for short code MUD and shows parts of the short code MUD algorithm would be supported by 

ip each processing node. 

i 
m 

p The aforementioned materials pertain to improvements on the methods and apparatus 

~^' s described in United States Provisional Application Serial No. 60/275,846, filed March 1 4, 200 1 , 

Q entitled IMPROVED WIRELESS COMMUNICATIONS SYSTEMS AND METHODS, the 

: :l ;i teachings of which are incorporated herein by reference and a copy of which is attached hereto. 

That copy bears the U.S. Postal Service Express Mail label number of both the original filing, as 

:d well as that of this filing (the latter being referred to as the "New Exp. Mail Label No."). 
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Background of the Invention 

The invention pertains to wireless communications and, more particularly, to methods 
and apparatus for interference cancellation in code-division multiple access communications. 
The invention has application, by way of non-limiting example, in improving the capacity of 
cellular phone base stations. 

Code-division multiple access (CDMA) is used increasingly in wireless communications. 
It is a form of multiplexing communications, e.g., between cellular phones and base stations, 
based on distinct digital codes in the communication signals. This can be contrasted with other 
wireless protocols, such as frequency-division multiple access and time-division multiple access, 

t~ in which multiplexing is based on the use of orthogonal frequency bands and orthogonal time- 

Q slots, respectively. 

€i 

£ A limiting factor in CDMA communication and, particularly, in so-called direct sequence 

j?{ CDMA (DS-CDMA), is the interference between multiple simultaneous communications, e.g., 

■ multiple cellular phone users in the same geographic area using their phones at the same time, 

fij This is referred to as multiple access interference (MAI). It has effect of limiting the capacity of 

t=* cellular phone base stations, since interference may exceed acceptable levels - driving service 

; = " quality below acceptable levels ~ when there are too many users. 

m 

A technique known as multi-user detection (MUD) reduces multiple access interference 
and, as a consequence, increases base station capacity. MUD can reduce interference not only 
between multiple signals of like strength, but also that caused by users so close to the base 
station as to otherwise overpower signals from other users (the so-called near/far problem). 
MUD generally functions on the principle that signals from multiple simultaneous users can be 
jointly used to improve detection of the signal from any single user. Many forms of MUD are 
known; surveys are provided in Moshavi, "Multi-User Detection for DS-CDMA Systems," IEEE 
Communications Magazine (October, 1996) and Duel-Hallen et al, "Multiuser Detection for 
CDMA Systems," IEEE Personal Communications (April 1995). Though a promising solution 
to increasing the capacity of cellular phone base stations, MUD techniques are typically so 
computationally intensive as to limit practical application. 
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An object of this invention is to provide improved methods and apparatus for wireless 
communications. A related object is to provide such methods and apparatus for multi-user 
detection or interference cancellation in code-division multiple access communications. 



A further object of the invention is to provide such methods and apparatus as can be cost- 
effectively implemented and as require minimal changes in existing wireless communications 
infrastructure. 

A still further object of the invention is to provide methods and apparatus for executing 
iU multi-user detection and related algorithms in real-time. 

y3 A still further object of the invention is to provide such methods and apparatus as manage 

/S faults for high-availability. 

CI 
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Summary of the Invention 

These and other objects are met by the invention which provides, in one aspect, a 
wireless communications system, referred to as the "MCW-1" (among other terms) in the 
materials that follow, and methods of operation thereof. An overview of that system is provided 
in the document entitled "Software Architecture of the MCW-1 MUD Board," immediately 
following this Summary. A more complete understanding of its implementation may be attained 
by reference to the other attached materials. 

In view of those materials, aspects of the invention include, but are not limited to, the 
u following: 

. hardware and/or software architectures (and methods of operation thereof) for 
multi-user detection in wireless communications systems and particularly, for 
ri example, in a wireless communications base station; 

m 

Q . a hardware architecture (and methods of operation thereof) for multi-user 

J=y detection in wireless communications systems pairing each processing node with 

j~ NVRAM and watchdog PLD for fault management; 

methods and apparatus for connecting watchdog PLDs with an out-of-band fault- 
management bus; 

. methods and apparatus for use of an embedded host with the RACEway™ 
architecture of Mercury Computer Systems, Inc. 

. methods and apparatus for interfacing a digital signal processor to the 
RACEway™ architecture; 
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methods and apparatus for interfacing the RACEway™ architecture to a 
programming port in a device for multi-user detection in wireless communications 
systems; 

methods and apparatus for implementing a DMA Engine FPGA for use in multi- 
user detection in a wireless communications systems; 

methods and apparatus for implementing a hardware-based reset voter and stop 
voter; 

methods and apparatus for scalable mapping of handset and BTS functions to 
multiple processors; 

methods and apparatus for facilitating allocation and management of buffers for 
interconnecting processors that implement the aforementioned mapping; 

methods and apparatus for implementing a hybrid operating system, e.g., with the 
Vx Works operating system (of WindRiver Systems, Inc.) on a host computer and 
the MC/OS operating system on RACE®-based nodes. (Race and MC/OS are 
trademarks of Mercury Computer Systems, Inc.); 

methods and apparatus for high-availability multi-user detection in wireless 
communications systems, including (by way of non-limiting example) round- 
robin fault testing and use of NVRAM to store fault symptoms and use of master 
to diagnose faults from NVRAM contents; 

class library-based methods and apparatus for facilitating interprocessor 
communications, by way of non- limiting example, in buffering for multi-user 
detection in wireless communications systems; 
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. methods and apparatus for implementation of R-matrix, gamma-matrix and MPIC 
computations on separate processors in a device for multi-user detection in 
wireless communications systems; 

. methods and apparatus for computing complementary R-matrix elements in 
parallel using multiple processors in a device for multi-user detection in wireless 
communications systems; 

. methods and apparatus for depositing results of R-matrix calculations 
contiguously in memory in a device for multi-user detection in wireless 
communications systems; 

methods and apparatus for increasing the number of MPIC and R-matrix 
calculations performed in cache in a device for multi-user detection in wireless 
communications systems; 

. methods and apparatus for performing a gamma-matrix calculation in FPGA in a 
device for multi-user detection in wireless communications systems; 

. methods and apparatus for equalizing load of R-matrix-element calculation 
among multiple processors in a device for multi-user detection in wireless 
communications systems; and 

methods and apparatus for use of Altivec registers and instruction set in 
performing MUD calculations in a wireless communications system. 

These and other aspects of the invention (including utilization of the aforementioned 
methods and aspects for other than wireless communications and/or interference cancellation) 
are evident in the materials that follow. 



Page No. 5 



EV 093 931 868 US 
Page No. 32 



Detailed Description of the Invention 



(see attached materials) 
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43 1 Purpose 

44 The purpose of this document is to describe the software architecture of the 

45 MCW-1 board. The MCW-1 application is a digital signal processing application 

46 that performs interference cancellation for a cellular base station modem board. 

47 The software project consists of 3 major parts: 

48 • Support for the custom MCW-1 board being designed by the Wireless 

49 Communications Group hardware department. This consists of porting the 

50 existing host (VxWorks) and multicomputer (MC/OS) software to the board, 

51 and adding code to support specialized features of the board such as LED 

52 control, voltage monitoring, hardware watchdogs, etc. 

53 • Increasing the MTBF of the system by addition of high availability software. 

54 This software includes monitoring features such as watchdogs, fault 

55 detection/repair algorithms, and remote software download. 

M« 56 • Implementation of me application software. This includes optimal 

O 57 implementation of the MUD algorithms, as well as implementing degraded 

O • 58 versions of the algorithm that can be executed when some of the 

tfl 59 computational hardware is unavailable due to failures. 

£ 60 

tff 61 Detailed information on the design of new software for the MCW- 1 board can 

CI 62 be found in the appropriate functional design documents, which are listed in the 

0 - 63 References section of this document. 

O 64 2 Glossary 

2 65 

% 66 1. MTBF - Mean Time Between Failures 

PI 67 2. MUD - Multi User Detection. A class of algorithms to detect multiple 

if; 68 interference sources and remove those effects from the signal. 

69 3. Multi computer -a parallel computer which achieves it's increase in performance 

70 by having more than one CPU working on the application simultaneously. 

71 4. VxWorks - a proprietary real time operating system sold by Wind River, Inc. 

72 3 Application Execution Environment 

73 3.1 Overview 

74 The purpose of the MUD application is to input raw antenna data from the base 

75 station modem card, detect sources of interference, produce a new stream of data 

76 which has had interference removed, and then output the data to the modem card 

77 for further processing. 

78 Characteristics of this processing afe -are t hat it must have low latency (< 300 

79 microseconds), ami-must deal with large amounts of data (> 1 10 million bytes of 

80 data per second), an d must be very reliable. 

81 The Mercury computer system is well suited to this kind of signal processing, 

82 exhibiting both very low latencies and high bandwidths. 
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83 The system hardware and software were not designed with high availability as 

84 a goal, so reliability is in line with other standard computer systems designed for 

85 commercial applications 

86 Input data flows from the Modem Motherboard, over the PCI bus, through the 

87 PXB++ bridge, onto the fabric, through the crossbar, and into the memory of the 

88 computing elements. Output data flows in the opposite direction. Some data will 

89 also flow between the 8240 Host CPU and the compute elements, via a similar 

90 pathway, i.e. from the PCI bus through the PXB++ and thus onto the fabric. 

91 Although the software tries to treat the system as if the hardware were 

92 symmetric, as can be seen in the following figure, the host 8240 CPU is attached 

93 via the PCI bus, not directly to the fabric. 
94 

95 | Error! Not a valid link. 

96 Figure 1 



Operating System 

MC/OS was selected as the operating system for the MCW-1 board because it 
provides the low latencies and high I/O and IPC bandwidths required for these 
sorts of algorithms, and also because it already provides support for most of the 
hardware being incorporated on the MCW-1 board. 

The MUD application can be kept as portable as possible by minimizing the 
use of non-POSIX MC/OS system calls, and encapsulating calls into proprietary 
MC/OS interfaces such as DX. 

MC/OS requires the presence of a host computer system, which in this case 
will be a Motorola 8240 PowerPC processor running the Vx Works operating 
system. 
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Figure 2 



The MC/OS DX subsystem will be used for IPC within the application. This 
API provides low overhead, low latency access to the Mercury DMA engines, 
which in turn provide high bandwidth transfers of data. DX will be used to move 
data between the G4 compute elements during parallel processing, and also will 
be used to move data between the MC/OS compute elements, the VxWorks host 
computer, and the motherboard modem card. 



Input / Output between the MUD card and the motherboard modem card takes 
place by moving data between the Race++ Fabric and the PCI bus via the PXB++ 
bridge. The application will use DX to initialize the PXB++ bridge, and to cause 
input/output data to move as if it were regular DX IPC traffic. 

Discussions with the customer need to take place in order to determine exactly 
how data flows over the PCI bus. For instance, it is currently unclear who will 
initiate data transfers, and how the initiator will know which PCI addresses should 
be involved in the transfer. A number of meetings with the customer are required 
to resolve these issues. 
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3.5 


High Availability 
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The approach to high availability on the MCW-1 card is to do most of the high 
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availability processing at a time when the application is not running. Specifically, 




130 




faults are handled by rebooting the system (fairly quickly). When the system 




131 




comes up, the application can determine which processing resources are available, 




132 




and it is up to the application to determine how to map its processing needs onto 




133 




the available resources. 




134 




This approach to high availability means that there are short interruptions in 




135 




service, but that the application does not need to know how to continue execution 




136 




across faults. For instance, the application can make the assumption that the 




137 




hardware configuration will not change without the system first rebooting. 




138 




If the application has state which needs to be preserved across reboots, the 




139 




application is responsible for checkpointing the data on a regular basis. The 




140 




system software will provide an API to a portion of the non-volatile RAM for this 




141 




purpose. It should be noted that the non-volatile RAM is quite small, and that 




142 




storage of more than a few hundred bytes of data will require another mechanism 


9 


143 




to be put in place. 


S 


144 


4 


Operating System Environment 


yj 


145 


4.1 


Overview 




146 




Mercury Computer Systems, Inc. has historically had the concept of a host 




147 




computer system. This dates back to the days when Mercury produced array 


-- 


148 




processors that were attached to customers' mainframe computers. The evolution 


'1... 

y 


149 




of Mercury multicomputers has left a vestigial host that often performs little more 




150 




service than as a bootstrap device for the multicomputer. 




151 




The host computer system survives in the MCW-1 design primarily as a way to 




152 




reduce schedule risk. The existence of a host computer system is assumed in so 




153 




many ways by the existing Mercury software, that it would add significant 


in 






schedule risk to attempt to remove this assumption in the MCW-1 timeframe. 




155 




In the MCW-1 board, the host system performs the following functions: 




156 




• It configures the Compute Elements, Fabric, and Bridges 




157 




• It loads executable code into the Compute Elements 




158 




• It serves as a bridge to the TCP/IP internetwork 




159 




• It serves as a file system daemon 




160 




• It runs some of the application software 




161 




• It manages some of the specialized high availability hardware 




162 


4.2 


Bootstrap 




163 




The host computer system is based on a Motorola 8240 PowerPC processor on 




164 




the MCW-1 board. The 8240 is attached to an amount of linear flash memory. 




165 




This flash memory serves several purposes. 




166 




The first purpose the flash memory serves is as a source of instructions to 




167 




execute when the 8240 comes out of reset. Linear flash is flash which can be 




168 




addressed as if it was normal RAM. Flash memories can also be organized to look 




169 




like disk controllers; however in that configuration they require a disk driver to 
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1 70 provide access to the flash memory. Although such an organization has several 

171 benefits such as automatic reallocation of bad flash cells, and write wear leveling, 

1 72 it is not appropriate for initial bootstrap. 

173 The flash memory also serves as a file system for the host (see Section 4.6), 

174 and as a place to store board permanent information (such as a serial number). 

175 Refer to the function design specification (TBS) for more details on how flash 

176 memory is used. 

177 When the 8240 first comes out of reset, memory is not turned on. Since high 

178 level languages such as C assume some memory is present (for a stack, for 

179 instance), the initial bootstrap code must be coded in assembler. This assembler 

1 80 bootstrap should only be a few hundred lines of code, sufficient to configure the 

1 8 1 memory controller, initialize memory, and initialize the configuration of the 8240 

1 82 internal registers. 

1 83 After the assembler bootstrap has finished execution, control is passed to the 

1 84 MCW-1 H.A. code (which is also contained in boot flash memory). The purpose 

1 85 of the H.A. code is to attempt to configure the fabric, and load the compute 
!M= 1 86 element CPUs with H.A. code. Once this is complete, all the processors 

p 1 87 participate in the H.A. algorithm. The output of the algorithm is a configuration 

13 1 88 table which details which hardware is operational and which hardware is not. This 

1 89 is an input to the next stage of bootstrap, the Multicomputer Configuration. 

ill 190 4.3 Multicomputer Configuration 

W 191 MC/OS expects the host computer system to configure the multicomputer. The 

* * 1 92 configmc program reads a textual description of the computer system 

L. 193 configuration, and produces a series of binary data structures that describe the 

H 1 94 computer system configuration. These data structures are used in MC/OS to 

fj 1 95 describe the routing and configuration of the multicomputer. 

\Z 196 The MCW-1 board will use almost exactly the same sequence to configure the 

j 197 multicomputer. The major difference is that MC/OS expects configurations to be 

^ 1 98 totally static, whereas the MCW- 1 configuration will need to change dynamically 

1 y 199 as faulty hardware cause various resources to be unavailable for use. 

200 There are currently two proposals being considered for how this dynamic 

201 reconfiguration takes place. 

202 The first proposal is that the binary data structures produced by configmc are 

203 modified to include flags that indicate whether a piece of hardware is usable or 

204 not. A modification to MC/OS would prevent it from using hardware marked as 

205 broken. The risk here is that the modifications to MC/OS may be non-trivial. The 

206 benefit may be faster reboot times. 

207 The second proposal is that the output of the H. A. algorithm is used to produce 

208 a new configuration file input to configmc, the configmc execution is repeated 

209 with the new file, and MC/OS is configured and loaded with no knowledge of the 

210 broken hardware whatsoever. This proposal has the added benefit that configmc 

211 may be able to calculate the most optimal routing tables in the face of failed 

212 hardware, minimizing the performance impact of the failure on the remaining 

213 components. This proposal provides risk reduction given that MC/OS changes 

214 would not be required. 
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215 4.4 Multicomputer Loading 

216 After the host computer has configured the multicomputer, the runmc program 

217 loads the functional compute elements with a copy of MC/OS. The only changes 

218 required for the MCW-1 board is for the loading process to examine which 

219 hardware may be offline because it is faulty, and take this into account when 

220 determining which compute elements need to be loaded. 

221 4.5 TCP/IP Bridge 

222 We believe that the customer is likely to require access to the MCW- 1 board 

223 from a TCP/IP network. MC/OS nodes do not contain a TCP/IP stack; therefore 

224 the host computer system acts as a connection to the TCP/IP network. The 

225 Vx Works operating system contains a fully functional TCP/IP stack. All currently 

226 envisioned daemons that need access to the TCP/IP network will run on the host 

227 processor. Should the need arise for compute elements to access network 

228 resources, the host computer would have to act as a proxy, exchanging 

229 information with the compute element utilizing DX transfers, and then making the 
H 230 appropriate TCP/IP calls on behalf of the compute element. 



File System 

The host computer system needs a file system to store configuration files, 
executable programs, and MC/OS images. Rotating disks have insufficient MTBF 
times; therefore flash memory will be utilized. Rather than have a separate flash 
memory from the host computer boot flash, the same flash is utilized for both 
bootstrap purposes and for holding file system data. A commercial flash file 
system will be purchased and ported which provides DOS file system semantics 
as well as write wear leveling. Wear leveling attempts to spread the number of 
writes evenly across the sectors of flash memory, as flash memory can only be 
written a finite number of times before it is worn out. Modern flash devices can be 
written around 100,000 times before they are worn out. 

242 4.7 Remote Software Upgrade 



243 The current design of the MCW-1 board assumes that the customer will want 

244 to update system and application code in the field, via network. There are two 

245 portions of code which need to be updated - the bootstrap code which is executed 

246 by the 8240 processor when it comes out of reset, and the rest of the code which 

247 resides on the flash file system as files. 

248 When code is initially downloaded to the MCW-1, it is written as a group of 

249 files within a directory in the flash file system. A single top level file keeps track 

250 of which directory tree is used to boot the system. This file continues to point at 

251 the existing directory tree until a download of new software is successfully 

252 completed. When a download has been completed and verified, the top-level file 

253 is updated to point to the new directory tree, the boot flash is rewritten, and the 

254 system can be rebooted. 

255 A possible problem in multi-board systems is how to deal with different 

256 versions of released software on different boards. For instance, if board 1 has 

257 revision 1.0 of the software distribution, and board 2 has revision 1.1 of the 

258 software distribution, will the two versions work together, or will there be a way 
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259 to ensure that the same version of software is installed on all boards. This issue 

260 does not occur on the MCW-1 because it is a single board solution; therefore this 

26 1 issue can be addressed at a later time. 

262 A commercial solution to remote software upgrade is available, and has been 

263 ported to Vx Works. It is our intent to port this code at a future date. 



264 5 High Availability 

5.1 Goals 

The goal of the high availability features of the MCW-1 is to increase the 
MTBF of the system as much as possible with little or no increase in cost to the 
board. The requirement for minimal cost increase rules out such common 
approaches as hot or cold standby, replicated hardware, etc. 

It is not a goal to provide uninterrupted computing during hardware or software 
failures, nor is it a goal to provide fault tolerance. 

1, 25.2 Fault Detection & Isolation 

Fault detection is performed by having each CPU in the system gather as much 
information about what it observed during a fault, and then comparing the 
information in order to detect which components could be the common cause of 
the symptoms. In some cases, it may take multiple faults before the algorithm can 
detect which component is at fault. The requirement not to add expensive 
hardware for fault detection means that in many cases the algorithm will not be 
able to determine which component is at fault. 

The MCW-1 board has many single points of failure. Specifically, everything 
on the board is a single point of failure except for the compute elements. This 
means that the only hard failures that can be configured out are failures in the 
compute elements. However, many failures are transient or soft, and these can be 
recovered from with a reboot cycle. Therefore, we expect the high availability 
features to have a positive effect on the MTBF of the card. 

More detailed information is available in the functional design specification 
(1). 

288 5.3 Degraded Application 



289 In the case of hard failures of a compute element, the application will have to 

290 execute with reduced demand for computing resources. There are several 

291 strategies possible for the MUD algorithm to decrease computing demands, such 

292 as working with a smaller number of interference sources, or performing a less 

293 complete job of interference cancellation. 

294 We expect the computing requirements of the algorithm to be high enough that 

295 failure of more than a single compute element will cause the board to be 

296 inoperative. Therefore, the MCW-1 application only needs to handle two 

297 configurations: all compute elements functional and 1 compute element 

298 unavailable. We believe that a small amount of startup code can map the 

299 application onto the two possible configurations. Note that the single crossbar 

300 means that there are no issues as to which processes need to go on which 

301 processors - the bandwidth and latencies for any node to any other node are 
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302 identical on the MCW-1 . This will not be true of larger systems in the future, and 

303 we will eventually need a way to map computing and I/O requirements onto 

304 arbitrary hardware configurations. 

305 5.4 Remote Software Upgrade 

306 Downtime due to the updating of software is counted against the availability of 

307 a computer system, and therefore a remote reload of software is a necessity. The 

308 MCW-1 is capable of downloading new software during normal operation. The 

309 reboot strategy means that the downtime due to starting up new software is only a 

310 few seconds. 
311 

312 Referenced Documents 

313 

314 1 . "MC/OS High Availability Functional Design Specification", Yevgeniy 

315 Tarashchanskiy, 17 April, 2000. 
316 



| Mercury Computer Systems, Inc. Confidential 



Page 10 of 10*0 
Page No. 15 



EV 093 931 868 US 
Page No. 42 

Mercury Computer Systems, Inc. COMPANY CONFIDENTIAL 



Mercury Computer Systems 

Wireless Communications 
Hardware Engineering 



MCW-la Functional Specification 

Memorandum #SRI-1 
31 January 2001 



Revision 3.00 



This document was created using MS Word 97 and is located at 
TBD 

Notice: If you are not viewing this document in electronic form at the above full path-name, it is not guaranteed to be 
the latest revision. 

MCW- 1 a Functional Specification Created on 2/2/0 1 

- 1- 

PageNo. 16 



EV 093 931 868 US 
Page No. 43 

Mercury Computer Systems. Inc. COMPANY CONFIDENTIAL 



1 REVISION HISTORY 5 

2 REFERENCE DOCUMENTS 5 

3 MERCURY PART NUMBER 6 

4 FUNCTIONAL DESCRIPTION 7 

4.1 OVERVIEW 7 

4.2 FEATURES 10 

4.3 CONFIGURATION OPTIONS 1 1 

4.3.1 CPU Options 11 

4.3.2 SDRAM Options 11 

4.3.3 FLASH Memory Options 1 1 

4.3.4 Ethernet Options 1 1 

4.4 REQUIREMENTS 12 

4.4.1 Mechanical Form Factor 12 

4.4.2 Power Requirements 12 

4.4.3 Electrical Interface 12 

4.4.4 Functional 12 

4.5 COMPATIBILITY 12 

M= 4.6 PERFORMANCE 12 

•p 4.7 DETAILED DESCRIPTION 13 

|3 4.7.1 Modem Board Interface 13 

lf\ 4.7.2 Board Resets 1 3 

jl 4.7.3 Watchdog Monitor 15 

l Tr? 4.7.4 Operating Frequency 15 

^rf 4.7.4.1 Clock Margining 15 

0 4.7.5 Serial Configuration EEPROM 16 

01 4.7.5.1 PXB++ FPGA Serial EEPROM 16 

a 4.7.5.2 XBAR++ ASIC Serial EEPROM 1 6 

■~z 4.7.6 RACEway++ Interconnect 16 

H 4.7.7 Local PCI I/O Bus 16 

W 4.7.7.1 PXB++ Program EEPROM 17 

H" 4.7.8 Ethernet Interface 17 

Jp 4.7.9 MPC7400 or Nitro Computer Nodes (CNs) 1 7 

M 4.7.9.1 Processor 17 

5! 4.7.9.2 MPC7400 L2 Cache 17 

^ 4.7.9.3 PCE133 ASIC 17 

4.7.9.4 Address Map 17 

4.7.9.5 Interrupt 19 

4.7.9.6 PCE133 DIAG Bits 20 

4.7.9.7 MPC7400 Reset 20 

4.7.9.8 Boot Procedures 20 

4.7.9.9 MPC7400 CN SDRAM 20 

4.7.9.10 MPC7400 Non-Volatile RAM 20 

4.7.10 MPC8240 Host Controller 21 

4.7.10.1 Address Map 22 

4.7. 1 0.2 Register Description 23 

4.7.10.3 Interrupt 23 

4.7.10.4 MPC8240 Reset 24 

4.7.10.5 Boot Procedure 24 

4.7.11 Bulk FLASH Memory 24 

4.7.12 Real Time Clock 24 

4.7.13 Nonvolatile Memory 24 

4.7.14 Fault Status and Control Registers 25 

4.7.15 Majority Voter 25 

4.7.16 Discrete I/O 26 

4.7. 1 7 Interrupt Controller 28 

4.7.17.1 Interrupt Controller Operation 28 

MCW- 1 a Functional Specification Created on 2/2/0 1 

- 2- 



PageNo. 17 



EV 093 931 868 US 
Page No. 44 



s 

I 
5 



10 Appendix A: RACEway++ Over-the-Top Connector Pinout 

1 1 Appendix B: Modem Board Connector Pinout 

12 Appendix C: Ethernet Connector Pinout 

13 Appendix D: JTAG Connector Pinout 

14 Appendix E: MCW-1 A Part Cost 

1 5 Appendix H: Design Notes 

1 5. 1 MPC7400 and Nitro Bus Signaling Voltage Support ... 

1 5.2 Bypass Capacitors Selection 

] 5.3 Tantalum Capacitors Selection 



vlercurv Computer Systems. Inc. 


COMPANY CONFIDENTIAL 




4.7.18 Confi gurati on Jumpers 




29 




4.7.19 LEDs 




29 




4.7.20 Power Supply 




29 




4.7.20. 1 MPC7400 Core Power Supply 




29 




4.7.20.2 Main 3.3V Power Supply 




29 




4.7.20.3 Core and I/O 2.5V Power Supply 




29 




4.7.20.4 ASICs Power Supplies Tolerance Requirements 




29 




4.7.20.5 Power Supply Voltage Sequencing 




30 




4.7.20.6 Power Supply Monitoring 




31 




ELECTRICAL INTERFACE 






32 


5.1.1 Power Consumption 




32 




5.1.2 I/O 




32 




5.1.2.1 Over-the-Top RACEway++ Interlink 




32 




5.1.2.2 PCI 32-Bit Modem Connector 




32 




5.1.2.3 Ethernet 10/100BT 












^ 




MECHANICAL 







33 


6.1.1 Physical Outline 


rr ° r ' °° mar notde 


lined. 












6. 1 .3 Physical Constraint 




....33 




PWVTPrVMA/IF'NJT AT 

tlN V IKiJINlvlt-JNI 1 AL 






33 


7. 1 . 1 Temperature & Air Flow 




....33 




7.1.2 Humidity 




....33 




7.1.3 Operating Altitude 




33 




i.\a snocK & vioration 








7. 1 .5 Compliance 




....33 




7.1.6 Reliability 




....33 




oWnLntj &.JUivlrfcKo 






33 






.... 33 












8 3 J17 Jumper 




....34 




8.4 Jl 8 Jumper 




....34 




8.5 J19 Jumper 




....34 




8.6 J20 Jumper 




....34 




8.7 J21 Jumper 




.... 34 




8.8 J22 Jumper 




....34 




TESTABILITY 






34 


9.1 JTAG TEST SCAN 




....35 





...Error! Bookmark not defined. 



...Error! Bookmark not defined. 



Table 1 . Route Codes for MCW- 1 A Board XB AR 9 

Table 2 . Test Clock Connector 16 

MCW- 1 a Functional Specification Created on 2/2/0 1 

- 3- 

Page No. 1 8 



EV 093 931 868 US 
Page No. 45 



Table 3 


Master Address Map 




Table 4 


Boot FLASH Address Map 


r: 


Table 5. 


Slave Address Map 




Table 6. 


MPC8240 Address Map B 


22 


Table 7. 


Port X Address Map 


22 


Table 8. 


Fault Status Register Format 


25 




Fault Control Register Definition 




Table 10. 


Discrete Output Words 


27 


Table 1 1 . 


Discrete Input Words 




Table 12. 


Interrupt Controller Inputs 


28 


Table 13. 


MCW-l CN Power Consumption 


32 


Table 14. 


MCW-l Power Consumption 


32 


Table 15. 


RACEway-h- Fl Cable Mode Connector Pinout J-27 ... 


35 




RACEway-h- F2 Cable Mode Connector Pinout J-28... 




Table 17 


Modem Board Connector Pin Assignments 


38 


Table 18. 


Ethernet JSConnector Pin Assignments 




Table 19. 


JTAG Jx Connectors Pin Assignments 


40 




MCW- 1 CN @ 400 MHz Part Cost 




Table 21 


MCW- 1 Part Cost 


Error' boo* 1 ^^* D ffInf D 

R R ' KiMARK NOT DEFINED. 


Figure 1 . 


MCW-1 A Block Diagram 


8 


Figure 2. 


MCW- 1 A Board-Level Topology 


9 


Figure 3. 


Hard Reset Functional Block diagram 


14 


Figure 4. 


Example watchdog service sequences 


15 


Figure 5. 


Ideal Power Supply Sequencing 


30 


Figure 6. 


Real Power Supply Sequencing 


30 


Figure 7. 


Voltage Sequencing Circuits 


31 



Figure 8. MCW-l Outline.. 



.. Error! Bookmark not defined. 



MCW- la Functional Specification 



Created on 2/2/01 
- 4- 



PageNo. 19 



093 931 868 US 
Page No. 46 

Mercury Computer Systems. Inc. COMPANY CONFIDENTIAL 



1 REVISION HISTORY 

Revision 0.0 - 3/17/00 Steven Imperiali 
Initial Entry 

Revision 0.01 - 4/25/00 Steven Imperiali 

Minor corrections, filled in missing sections. 

Revision 0.1 - 5/5/00 Steven Imperiali 
Incorporated review comments. 

Revision 0.2 - 5/8/00 Steven Imperiali 
Incorporated review comments. 
Removed reference to RapidlO/Race-H- bridge 
Revision 0.21 - 5/16/00 Steven Imperiali 
Incorporated review comments. 
Modified MPC8240 memory map 
§s£ Revision 0.22 - 5/26/00 Steven Imperiali 

P| Modified MPC8240 Memory Map 

Revision 1 .00 - 7/24/00 Steven Imperiali 
71 Modified MPC8240 Memory Map 

Hj f. Updated memo with current design status 

C Revision 2.01 - 1 1/01/00 Steven Imperiali 

pi Modified power supply ramp requirements 



Revision 2.02 - 1 1/15/00 Steven Imperiali 
Modified interrupt controller 

Revision 2.03 - 1/26/01 Steven Imperiali 
Minor documentation corrections 

Revision 3.00- 1/31/01 Steven Imperiali 

Modified memo to reflect MCW- la modules 



REFERENCE DOCUMENTS 

1 . American National Standard for RACEway Interlink (ANSI/ VITA 5-1 994) 

2. PCI Rev 2.2 Local Bus Specification 

3. PCE 1 33 ASIC Hardware Specification 

4. XBAR++ Function Specification 

5. PXB++ PCI Bridge Functional Specification 

6. PowerPC 7400 PPC Microprocessor Hardware Specification 

7. Flash Memory Specification p/n TBD 

8. MCW- 1 Product Definition Document (PDD) vTBD 

9. Technical brief of Mercury Computer Systems RACE++ series topologies 

1 0. MPC8240 Users Manual (MPC8240UM/D 07/1 999 Rev. 0) 



MCW- 1 a Functional Specification Created on 2/2/0 1 



Page No. 20 



EV 093 931 868 US 
Page No. 47 

Mercury Computer Systems. Inc. COMPANY CONFIDENTIAL 



3 MERCURY PART NUMBER 

The board identifier name is MCW-la and the Mercury part number is 560549. 
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4 FUNCTIONAL DESCRIPTION 
4.1 OVERVIEW 

The MCW-la is designed to be an algorithm processing daughter card utilizing the MPC7400 PPC, MPC8240, 
PCE133 ASIC, XBAR++ ASIC, and PXB++ FPGA. The MCW-1 mates with a Motorola base station modem board. 
MCW-la can provide additional connectivity between processing elements in different sector slots utilizing over-the-top 
RACEway-H- cables. It is a Motorola form factor card with four computational nodes and one host node. The 
computational nodes (CNs) are based on the latest MPC7400 PPC microprocessor and the host is an MPC8240. The 
MCW-lcan provide one Ethernet 10/100 BT port on the front panel. A 32-bit, 66 MHz PCI interface provide the 
interface to the Motorola board. 

The MCW-1 a block diagram is shown in Figure 1 . 
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Figure 2 shows the MCW-1 a system topology. Table 1 gives the proposed route codes for the board. 




Table 1. Route Codes for MCW-la Board XBAR 



Route Code 


Destination for Virtual Ports 


Physical XBAR 1 Ports 


0 






1 






2 






3 






4 






5 






6 






7 






8 






9 






10 






11 






12 






13 






14 






15 
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4.2 FEATURES 

• Custom size daughter card 

• Master PCI 32-bit @ 66MHz compliant with REV2.2 PCI local bus spec. 
PCI write peak performance is 240 MB/sec. 

PCI read peak performance is 220 MB/sec. 

• Single IEEE802.3 compliant Ethernet 1 OB ASE-T// 100BASE-T 

• Four computation nodes (CNs) based on MPC7400 PPC running @ 400 MHz. 
1 MB L2 cache per CN @ 200 MHz to 266 MHz. 

128 MB SDRAM with ECC per CN @ 133 MHz. 
Hardware based watchdog monitor. 
One PCE133 ASIC per CN. 

• Two, over-the-top, 66 MHz RACEway++ interlink ports 
configured in cable mode. 

• PCI interface 32-bit @ 66 MHz. 

i«i • RACEway-H- crossbar to connect nodes. 

O • PXB++ 64-bit @ 33 MHz PCI bus. 

f| • Non-transparent 64-bit/33 MHz to 32-bit/66 MHz PCI bridge. 

* • 200MHz PPC8240 PowerPC processor. 

32-bit 33MHz PCI bus. 
ff? 100MHz, 64Mbytes SDRAM. 

jU. • Bulk FLASH interface. 
g°°"; Linear address mode. 

•"' 32 banks of 1 Mbytes. 

Jg • LEDs. 

' < • 8Kbytes non-volatile SRAM. 

m 

• Real time clock. 

• Compute node fault isolation control. 

• JTAG test port. 
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4.3 CONFIGURATION OPTIONS 

4.3.1 CPU Options 

• MPC7400 @ 400 MHz. 

• MPC7410@400MHz. 

4.3.2 SDRAM Options 

• 128 MB SDRAM @ 133 MHz with ECC. 

4.3.3 FLASH Memory Options 

• 1 6 MB FLASH memory. 

• 32MB FLASH memory 

4.3.4 Ethernet Options 

• No Ethernet. 

• Simgle Ethernet 
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4.4 REQUIREMENTS 

4.4.1 Mechanical Form Factor 

The MCW-la form factor conforms to TBD Motorola mechanical requirements. 

4.4.2 Power Requirements 

The MCW-la requires +5.0 volts from the modem board. The + 1.5V to +2.1 V MPC7400 core voltage required by the 
core of MPC7400 is converted from +5.0V on the board. There are two core supplies used to power the four cpu cores. 
The 2.5V voltage required is converted from +5.0V by an onboard power supply. The 3.3V voltage required is also 
converted from +5.0V by an onboard power supply. The MCW-la estimated typical power dissipation is 50 watts @ 
5.0 V. 

4.4.3 Electrical Interface 

The MCW-la provides a PCI 32-bit, 66 MHz interface to the Motorola modem board via an 80-pin connector. 

The MCW-la provides two over-the-top RACEway++ ports via two connectors located on the front panel. 

M= The MCW-1 a provides the single Ethernet 10/100 BT interface available from one RJ-45 connector. The Ethernet 

p| interface is provided by a third party Ethernet-to-PCI interface controller chip that is bridged to the crossbar 

'■-4 RACEway++ port by means of a PXB++ FPGA (See Figure 2). 

if! 

4.4.4 Functional 

yp 1 . Shall have the Main SDRAM memory at 133MHz or greater. 

O 2. Shall have a 1Mbyte L2 Cache at 200MHz or greater. 

m 3. All CE nodes shall have 128Mbyte of SDRAM. 

,. 4. Host node shall have at least 32Mbytes of nonvolatile memory. 

f~. Form factor requirements: 

i=y 

H| 5. Shall be a daughter card that is % of a Motorola proprietary form factor modem payload card sized 11" by 14". On 

5.p 20mm centers board to board. {actual shape, dimensions etc TBD via drawings from Motorola.} 

flj 6. Shall be electrically a PMC module, TBD from further discussions with customer. 

|U 7. Shall use P 1 , P2 for 32/66MHz PCI bus. 

8. Shall have a maximum heat dissipation of 50 W 

System requirements 

9. A minimum of 105Mbyte/sec from the modem payload module to the MCW-la card shall be provided through the 
PCI interface. 

10. From the MCW-la card to Motorola Modem Payload module output bandwidth shall be at least 200kbyte/sec, 
concurrent with the 105Mbyte/sec input. 

1 1 . The system shall have a bandwidth of at least 250Mbyte/sec between CE's, e.g. RACE++ at 66Mhz, as a minimum. 

12. Shall have non-volatile memory, for at least 32Mbytes of data. 

13. Shall support software upgrade from remote locations. 



4.5 COMPATIBILITY 

The MCW-la board is a custom daughter card designed for the Motorola base station modem board. 

4.6 PERFORMANCE 

The PCI bus standard and the PXB++ FPGA limits the RACEway++ to the PCI performance. Peak transfers of 240 
MB/sec are achievable between the PXB++, PPC8240 and the non-transparent PCI Bridge. (See Figure 1 ) 
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Data transfers of up to 266 MB/sec peak are supported for access from RACEway++ to/from the MPC7400 CE's local 
SDRAM memory. 

PCE133 ASIC-initiated DMA transfers run at optimum RACEway-H- speeds approaching 266 MB/sec peak. Data can 
be transferred with the DMA from a single DMA command transfer to/from the CN's local SDRAM memory to/from 
RACEway++. The DMA engine formats transfers across RACEway-H- optimally using packets up to 2048 Bytes. 

The operating clock frequency of the PCE133 ASIC, SDRAM, and MPC7400 processor bus is 133 MHz. Likewise, the 
operating frequency for the RACEway-H- is 66 MHz. The local PCI clock is used by the corresponding PXB++ FPGA 
and does not exceed 33 MHz. 

A separate 25 MHz oscillator is included on the MCW-la for driving the Ethernet interface. 

4.7 DETAILED DESCRIPTION 

The MCW-la block diagram is shown in Figure 1 . 



M 
O 
P 



4.7.1 Modem Board Interface 

TBD (PCI 32-bit 66MHz). 
TBD PCI to PCI bridge stuff. 
TBD Motorola requirements. 



Board Resets 



There are several sources of reset to the daughter card. A MAX823 voltage supervisor will generate a 200ms 
reset after VCC rises above 4.38 volts. When the MAX823 reset is deasserted, state machine logic will 
monitor PCI_RESET_0. The state machine will continue driving RESET_0 until both the MAX823 and 
PCI_RESET_0 are deasserted. Either reset will generate the signal RESETJ) which will reset the card into its 
power-on state. RESETJ) will also generate the HRESET_0 and TRST signals to the five CPUs. HRESET_0 
and TRST for each of the cpus can also be generated by their JTAG ports; JTAG_HRESET_0 and 
JTAG_TRST respectively. The MCP8240 is capable of generating a reset request, a soft reset (C_SRESET_0) 
to each CPU, a checkstop request, and a CE ASIC reset (CE_RESET_0) to each of the four CE ASICs. A 
discrete from the 5v powered reset PLD will generate the signal NPORESET_l (not a power on reset). This 
signal is fed into the MPC8240's discrete input word. The MPC8240 will read this signal as a logic low only if 
it is coming out of reset due to either a power condition or an external reset from offboard. Each node, as well 
as the MPC8240 may request a board level reset. These requests are majority voted, and the result 
RESETVOTE_0 will generate a board level reset 
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Figure 3 shows the MCW-la hard reset generation function 




Figure 3. Hard RESET Functional Block diagram 
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4.7.3 Watchdog Monitor 

There are five independent watchdog monitors on the MCW-la card. Each processor node is responsible for 
strobing its watchdog once every 20 msec (initial window after board level reset is 2 sec) but no sooner than 
500 usee. Strobing the watchdog for the processing nodes is accomplished by writing a zero/one sequence to 
the DIAG3 discrete coming from the PC133PCE ASIC. The MPC8240's watchdog is serviced by writing to 
the memory mapped discrete location FFFF_D027. A single write of any value will strobe the watchdog. Upon 
power-on, the watchdogs come up in a failed state; once a valid strobe is issued; the watchdog will be satisfied. 
If the CPU fails to service the watchdog within the valid window, the watchdog will fail. A watchdog of a 
failing processing node will trigger an interrupt to the MPC8240. An MPC8240 watchdog fault will trigger a 
reset to the board. The watchdog will then remain in a latched failed state until a CPU reset occurs followed by 
a valid service sequence. Figure 4 shows a valid service sequences of the watchdog. 

Reset 



Software ser vice 



~U LT 



~U LT 



U 



* 2 seconds | 



20msec ~~[ 

|500usecj 20 msec | 



Watchdog Fault conditions 



- A service within 500 usee of last service. 

- No service within 20 msec of last service. 



|500 usee] 



3 



|500 usee- 20 msec~| 



Figure 4. EXAMPLE WATCHDOG SERVICE SEQUENCES 



4.7.4 Operating Frequency 

The MPC7400 bus runs at 133 MHz. The L2 cache bus of the MPC7400 runs at 200 MHz to 266 MHz. The SDRAMs 
run at 133 MHz. The RACEway++ interface runs at 66 MHz. The local PCI bus runs at 33 MHz and the off board PCI 
runs at 66MHz. The MPC8240's internal frequency is 200 MHz while its SDRAM interface is 100 MHz. 

4.7.4.1 Clock Margining 

This card has two crystal oscillators for the three clock domains present on the card, a 66 MHz oscillator for the 
RACEway++ interface and MPC7400 CNs. The 66MHz frequency is divided in half to generate a 33 MHz signal for 
the PCI interface. A second oscillator, 25 MHz, clocks the Ethernet and watchdog circuitry. Both the PCI and MPC 
clocks are marginable. In order to provide clock margining, a 4-pin connector allows the test engineer to functionally 
disable the onboard oscillator and replace it with a test frequency. The pinout of this connector is detailed in Table 2. 
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Table 2. Test Clock Connector 



Pin 


Signal 


1 


GND 


2 


/Test Clock 


3 


Test Clock 


4 


Test Clock Enable L 



4.7.5 Serial Configuration EEPROM 

There are several serial EEPROMs used to loadconfiguration to the CE ASICs, PXB++ and XBAR++ after reset. The 
serial PROM functionality can be found in the ASIC's functional specification. 

4.7.5.1 CE ASIC Serial EEPROM 

The serial EEPROM can be read and programmed by means of the RACEway++ bus. It is programmed during 
manufacture of the MCW-la to contain configuration information for CE ASIC. The serial EEPROM AT24C128 is 
controlled from the CE ASIC. After reset, the CE ASIC automatically reads the first location from the serial EPROM. 
Refer to the CE ASIC functional specification, reference 3, for information on reading and writing this device. 



W 4.7.5.2 PXB++ FPGA Serial EEPROM 

M3 The serial EEPROM can be read and programmed by means of the PCI bus or the RACEway++ bus. It is programmed 

O during manufacture of the MCW-la to contain configuration information for PXB. The serial EEPROM AT24C128 

yl device is 128K bits and is controlled from the PXB++. After reset, the PXB++ automatically reads 8 KB from the serial 

a EEPROM and initializes the PXB++ internal registers. Refer to the PXB++ FPGA functional specification, reference 5, 

;rj for information on reading and writing this device. 

4.7.5.3 XBAR++ ASIC Serial EEPROM 

The serial EEPROM can be read and programmed by means of the RACEway++ bus. It is programmed during 
45 manufacture of the MCW-la to contain configuration information for XBAR++. The serial EEPROM AT24C128 is 

p controlled from the XBAR++ ASIC. After reset, the XBAR++ ASIC automatically reads from the serial EPROM and 

fy initializes the XBAR++ internal registers. Refer to the XBAR++ ASIC functional specification, reference 4, for 

information on reading and writing this device. 



4.7.5.3.1 Register Description 

Reference 4 f describes the registers of the XBAR++ ASIC. 

4.7.6 RACEway++ Interconnect 

Communication between all processing and I/O elements on the system card is provided by a Mercury eight-port 
crossbar XBAR++ ASIC. The XBAR++ provide up to three simultaneous 266 MB/sec peak throughput data paths 
between elements for a total peak throughput of 798 MB/sec. Three crossbar ports connect to the RapidIO Bridge 
FPGA. Each MPC7400 CN uses one crossbar port. The Ethernet and MPC8240 interface to a crossbar port through the 
PXB++. (See 0) Reference 4 describes the operation and registers of the XBAR++ ASIC. 

4.7.7 Local PCI I/O Bus 

The PXB++ FPGA provides the local PCI I/O bus. This bus is accessible by means of the RACEway++ from the 
processing nodes. All resources on this bus are initialized and controlled by the MPC8240. This bus provides access to 
an Ethernet controller, PCI to PCI transparent bridge and the PPC8240 host controller. Transfers from devices on this 
local PCI bus to and from devices on the RACEway-H- can achieve 240 MB/sec for writes and 220 MB/sec for reads. 
These rates assume block transfers of reasonable size. 
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4.7.7.1 PXB++ Program EEPROM 

The PXB++ FPGA is programmed by an XC18V04 configuration EEPROM running in parallel mode. Configuration 
initiates when a power-on or board level reset occurs. Dividing the onboard 33MHz generates the configuration clock of 
16.6MHz. The configuration EEPROM itself is onboard programmable through the JTAG scan chain. 

4.7.7.1.1 Register Description 

Reference 5 describes the registers of the PXB++ ASIC. 



4.7.8 Ethernet Interface 

The PCI-to-Ethernet interface uses the AM79C973 Pcnet-FAST III single chip 10/100 Mbps Ethernet controller. This 
device is equipped with a built in physical layer interface to achieve a minimal parts count Ethernet interface. A 25 MHz 
oscillator provides the proper clock frequency to the Ethernet chip. The PCI interrupt from the Ethernet chip is wired to 
the MPC8240's external interrupt controller. 

4.7.9 MPC7400 or Nitro Computer Nodes (CNs) 

The board contains four MPC7400 CNs. Each MPC CN uses a PCE133 ASIC to interface the cpu to RACEway++. The 
PCE133 ASIC provides all the standard features of a CN, such as a DMA engine, mail box interrupts, timers, 
RACEway-H- page mapping registers, SDRAM interface, and so on. Local memory for each CN consists of 32, 64, or 
128 MB SDRAM, and L2 cache SRAM. Each CN also has a nonvolatile SRAM and watchdog monitor. The cpu bus is 
64-bit data, 32-bit address, and operates synchronously at 133 MHz. 



® 



4.7.9.1 Processor 

p The MCW-la card is designed to use either the 400 MHz MPC7400 or the 400 MHz Nitro processors. The processor is 

f y packaged in a 25mm, 360-ball CBGA package. Each processor requires the attachment of a heat sink to keep it within 



its thermal limits. 

4.7.9.2 MPC7400 L2 Cache 

The MPC7400 L2 cache for each CN is composed of pipelined, single-cycle deselect, sync burst SRAM. This is 
implemented using two 64K, 128K, or 256K by 36-bit sync burst SRAM parts to make a 0.5 MB, 1 MB, or 2 MB L2 
cache. MPC7400 L2 cache can be depopulated to 0 MB. 

4.7.9.3 PCE133 ASIC 

The MPC processor compute element ASIC (PCE133 ASIC) is a Mercury-designed component. It provides the 
interface between the MPC7400, the synchronous DRAM, and the RACEway++. All the PCE133 features such as 
DMA, mailbox interrupts, timers, address snooping, prefetch buffers, and so on, are available in this configuration. This 
chip is provided in a 35mm, 388-ball BGA package. Reference 3 describes the operation and registers of the PCE133 
ASIC. 

4.7.9.3.1 Register Description 

Reference 3 describes the registers of the PCE133 ASIC. 

4.7.9.4 Address Map 
4.7.9.4.1 Master Address Map 
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Transfers from the MPC7400 to the PCE133 ASIC and RACEway-H- are address mapped as shown in Table 3. 
The SDRAM is 8-, 16-, 32-, or 64-bit addressable. RACEway-H- locked read/write and locked read 
transactions are supported for all data sizes. The 16 Mbyte boot FLASH area is further divided in Table 4 
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Table 3. Master Address Map 



From Address 


To Address 


Function 


0x0000 0000 


OxOFFF FFFF 


Local SDRAM 256 MB 


0x1000 0000 


OxlFFF FFFF 


XBAR 256 MB map window 1 


0x2000 0000 


0x2FFF FFFF 


XBAR 256 MB map window 2 


0x3000 0000 


0x3FFF FFFF 


XBAR 256 MB map window 3 


0x4000 0000 


0x4FFF FFFF 


XBAR 256 MB map window 4 


0x5000 0000 


0x5 FFF FFFF 


XBAR 256 MB map window 5 


0x6000 0000 


0x6FFF FFFF 


XBAR 256 MB map window 6 


0x7000 0000 


0x7FFF FFFF 


XBAR 256 MB map window 7 


0x8000 0000 


0x8FFF FFFF 


XBAR 256 MB map window 8 


0x9000 0000 


0x9FFF FFFF 


XBAR 256 MB map window 9 


OxAOOO 0000 


OxAFFF FFFF 


XBAR 256 MB map window A 


OxBOOOOOOO 


OxBFFF FFFF 


XBAR 256 MB map window B 


OxCOOO 0000 


OxCFFF FFFF 


XBAR 256 MB map window C 


OxDOOO 0000 


OxDFFF FFFF 


XBAR 256 MB map window D 


OxEOOO 0000 


OxEFFF FFFF 


XBAR 256 MB map window E 


OxFOOO 0000 j 


OxFBFF FBFF 


Not used (CE reg replicated mapping) 


OxFBFF FCOO 


OxFBFF FDFF 


Internal CN ASIC registers 


OxFBFF FEOO 


OxFEFF FFFF 


Prefetch control 


OxFFOO 0000 


OxFFFF FFFF 


1 6 MB boot FLASH memory area 



Table 4. Boot FLASH Address Map 





From Address 


To Address 


Function 




OxFFOO 2006 


OxFFOO 2006 


Software Fail Register 




OxFFOO 2005 


OxFFOO 2005 


MPC8240 HA Register 


M 


OxFFOO 2004 


OxFFOO 2004 


Node 3 HA Register 




OxFFOO 2003 


OxFFOO 2003 


Node 2 HA Register 




OxFFOO 2002 


OxFFOO 2002 


Node 1 HA Register 




OxFFOO 2001 


OxFFOO 2001 


Node 0 HA Register 




OxFFOO 2000 


OxFFOO 2000 


Local HA Register (status/control) 




OxFFOO 0000 


OxFFOO 1FFF 


NovRAM 



4.7.9.4.2 Slave Address Map 

Slave accesses are defined as accesses initiated by an external RACEway++ device directed toward the MPC7400 CN. 
The MPC is not accessible as a slave device. The SDRAM is 8-, 16-, 32-, or 64-bit addressable. RACEway-H- locked 
read/write and locked read are supported for all data sizes. The PCE RACEway port supports a 256 MB address space 
partitioned as follows in Table 5: 

Table 5. Slave Address Map 



From Address 


To Address 


Function 


0x0000 0000 


OxOFFF FBFF 


256 MB less 1 KB hole SDRAM 


0Xfff_FC00 


0xFFF_FFFF 


PCE 133 internal registers 



4.7.9.5 Interrupt 

Reference 3 describes the internal interrupt sources for the PCE133 ASIC. The external interrupt pin on the PCE133 
ASIC is driven by the HA PLD and is currently not used. The interrupt output from the PCE133 ASIC is wired to the 
CPU's external interrupt input pin. 
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4.7.9.6 PCE133 DIAG Bits 

The DIAG3 signal is wired to the HA PLD and is used to strobe the nodes hardware watchdog monitor. The DIAG2 
signal is wired to the MPC8240's interrupt controller and is used, by the node, to generate a general purpose interrupt to 
the MPC8240. The DIAGBIT signal is wired to the HA PLD and is currently not used. 

4.7.9.7 MPC7400 Reset 

The MPC7400 hard reset signal is driven by three sources gated together: the HRESET_0 pin on the PCE133 ASIC, 
HRESET_0 from the JTAG connector, and HRESET_0 from the majority voter. The HRESET_0 pin from the CE ASIC 
is set by the "node run" bit field (bit 0) of the PCE133 ASIC's Miscon_A register. Setting HRESET_0 low causes the 
MPC7400 to be held in reset. HR£SET_0 is low immediately after system reset or power-up, the MPC7400 is held in 
reset until the HRESET_0 line is pulled high by setting the node run bit to 1. The JTAG HRESETJ) is controlled by 
debugger software when a JTAG debugger module is connected to the card. The HRESET_0 from the majority voter is 
generated by a majority vote from all healthy nodes to reset. 

4.7.9.8 Boot Procedures 

When a cpu reset is asserted, the MPC7400 is put into reset state. The MPC7400 will remain in a reset state until the 
RUN bit 0 of the Miscon_A register is set to 1 and the MPC8240 has released the reset signals in the discrete output 
word. The RUN bit should be set to 1 after the boot code has been loaded into the SDRAM starting at location 
0x0000_0100. The ASIC maps the reset vector OxFFFO_0100 generated by the MPC7400 to address OxOOOOJHOO. 

4.7.9.9 MPC7400 CN SDRAM 

The main memory for each CN is composed of one bank of synchronous DRAM. This is implemented using five 
K4S280832A-TC/L75 @133 MHz synchronous DRAM parts. As shown in the memory map (See Table 3), the main 
memory begins at address 0x0 and grows upward in the address space as memory is increased. The PCE133 ASIC 
supports error correction (ECC) on the SDRAM. 

The SDRAM operates as zero wait state memory and can provide up to 1 GB/sec peak bandwidth on writes from 
MPC7400 and 800 MB/sec peak bandwidth on read from the MPC7400. ECC error correction is supported. 

4.7.9.10 MPC7400 Non- Volatile RAM 

Each node will be equipped 8Kx8 of non-volatile RAM for the storage of fault record data and configuration 
information. This function is implemented using a SIMTEK STK12C68S45 NOVRAM attached to the PCE133 ASIC's 
boot FLASH interface. The data bus of the device is isolated from the PCE ASIC through an IDT IDTQS32244SO 
buffer. This buffer provides loading isolation and 3.3v to 5v translation. 
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4.7.10 MPC8240 Host Controller 

The MPC8240 integrated processor is comprised of a peripheral logic block and a 32-bit embedded MPC603e PowerPC 
processor core. The peripheral logic integrates a PCI bridge, memory controller, DMA controller, EPIC interrupt 
controller, a message unit, and an I2C controller. The processor core is a full featured, high-performance processor with 
floating-point support, memory management, 16Kbytes instruction cache, 16Kbytes data cache, and power management 
features. 



Major features of the MPC8240 are as follows: 
Peripheral logic 

- Memory interface 

High-bandwidth bus, 64-bit data bus, to SDRAM. 
ECC Protected SDRAM 
16 Mbytes of ROM space (32Mbytes paged). 
8-bit ROM. 

Write buffering for PCI and processor accesses. 

- PCI Interface 

j»4 32-bit PCI interface operating at 33 MHz (66 MHz capable), 

f PCI 2.1 -compatible. 

: Support for accesses to all PCI address spaces. 

iP J Selectable big- or little-endian operation. 

Store gathering of processor-to-PCI write and PCI-to-memory write accesses. 
==f PCI bus arbitration unit (five request/grant pairs). 

" I - Two-channel integrated DMA controller 

Q Supports direct mode or chaining mode (automatic linking of DMA transfers). 

|f 1 Supports scatter gathering read or write discontinuous memory. 

5 Interrupt on completed segment, chain, and error. 

Local-to-local memory. 
fi PCI-to-PCI memory. 

PCI-to-local memory. 
=3° Local-to-PCI memory. 

- Message unit 
f3 Two doorbell registers. 

Inbound and outbound messaging registers. 

I 2 O message controller. 
- 1 2 C controller with full master/slave support 

- Embedded programmable interrupt controller (EPIC) 

Five hardware interrupts (IRQs) or 1 6 serial interrupts. 
Four programmable timers. 

- Integrated PCI bus and SDRAM clock generation 

- Programmable memory and PCI bus output drivers 

- Debug features 

Memory attribute and PCI attribute signals. 
Debug address signals. 

MIV signal: Marks valid address and data bus cycles on the memory bus. 
Error injection/capture on data path. 
IEEE 1 149.1 (JTAG)/test interface. 
Processor core 

- High-performance, superscalar processor core 

Integer unit (IU). 

Foating-point unit (FPU) (user enabled or disabled). 
Load/store unit (LSU). 
System register unit (SRU). 
Branch processing unit (BPU). 
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- 16-Kbyte instruction cache 

- 16-Kbyte data cache 

- Lockable LI cache - entire cache or on a per-way basis 

- Dynamic power management 

4.7.10.1 Address Map 

The MPC8240 in PCI host mode supports two address mapping configurations designated as address map A, and 
address map B. Address map A conforms to the PowerPC reference platform (PReP) specification. Address map B 
conforms to the PowerPC microprocessor common hardware reference platform (CHRP). Note that the support of map 
A is provided for backward compatibility only. It is strongly recommended that new designs use map B because map A 
may not be supported in future devices. 

Address map B complies with the PowerPC microprocessor common hardware reference platform (CHRP). The address 
space of map B is divided into four areas: system memory, PCI memory, PCI I/O, and system ROM space. When 
configured for map B, the MPC8240 translates addresses across the internal peripheral logic bus and the external PCI 
bus as shown in Table 6. 



Table 6. MPCS240 Address Map B 



P 
p 



Processor Core Address Range 


PCI Address Range 


Definition 


Hex 


Decimal 


0000_0000 


0009_FFFF 


0 


640K - 1 


NO PCI CYCLE 


System memory 


OOOA_0000 


000F_FFFF 


640K 


1M-1 


000A_0000 - 000F_FFFF 


Compatibility hole 


0010_0000 


3FFF_FFFF 


1M 


1G-1 


NO PCI CYCLE 


System memory 


4000_0000 


7FFF_FFFF 


1G 


2G-1 


NO PCI CYCLE 


Reserved 


8000_0000 


FCFF_FFFF 


2G 


4G-48M-1 


8000_0000-FCFF_FFFF 


PCI memory 


FD00_0000 


FDFF_FFFF 


4G-48M 


4G-32M-1 


0000_0000 - 00FF_FFFF 


PCI/ISA memory 


FEOO_0000 


FE7F_FFFF 


4G-32M 


4G-24M-1 


0000_0000 - 007F_FFFF 


PCI/ISA I/O 


FE80_0000 


FEBF_FFFF 


4G-24M 


4G-20M-1 


0080_0000 - 00BF_FFFF 


PCI I/O 


FEC0_0000 


FEDF_FFFF 


4G-20M 


4G-18M-1 


CONFIG_ADDR 


PCI configuration address 


FEE0_0000 


FEEF_FFFF 


4G-18M 


4G-17M-1 


CONFIG_DATA 


PCI configuration data 


FEFO_0000 


FEFF_FFFF 


4G-17M 


4G-16M-1 


FEFO_0000 - FEFF_FFFF 


PCI interrupt acknowledge 


FF00_0000 


FF7F_FFFF 


4G-16M 


4G-8M-1 


FFOO_0000 - FF7F_FFFF 


32/64-bh FLASH/ROM (I) 


FF80_0000 


FFFF FFFF 


4G-8M 


4G-1 


FF80 0000 -FFFF FFFF 


8/32/64-bit FLASH/ROM (2) 



(1) This bank of FLASH is not used. 

(2) This bank of FLASH is configured ir 



l-bft mode and is further broken down in Table 7. 



Table 7. Port X Address Map 
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Bank 


Processor Core Address Range 


Definition 


Select 








11111 


FFE0_0000 


FFEF_FFFF 


Accesses Bank 0 


11110- 


FFEO_0000 


FFEF_FFFF 


Application code ( 1 ) (30 pages) 


00001 








00000 


i^SSioooo 


FFEF FFFF 


\ •, lie « ■ j * boot co< . ( \ ), (2) 




FFF0 0000 


FFFF CFFF 


Application/boot code (2) 




FFFF D000 


FFFF DOOO 


Discrete input word 0 




FFFF_D001 


FFFF D001 


Discrete input word 1 




FFFF DO 02 


FFFF D002 


Discrete output word 0 




FFFF D003 


FFFF D003 


Discrete output word 1 




FFFFJD004 


FFFF D004 


Discrete output word 2 




FFFF DO 10 


FFFF DO 10 


IC (Pending interrupt) 




FFFF DO 11 


FFFF DO 1 1 


IC (Interrupt mask low) 




FFFF DO 12 


FFFF DO 12 


IC (Interrupt clear low) 




FFFF DO 13 


FFFF DO 13 


IC (Unmasked, pending low) 


XXXX (3) 


FFFF DO 14 


FFFF DO 14 


IC (Interrupt input low) 


FFFF DO 15 


FFFF DO 15 


Unused (read FF) 




FFFF DO 16 


FFFF DO 16 


Unused (read FF) 




FFFF DO 17 


r r r r — UVJ l 1 


Unused (read FF) 




FFFF_D0 1 8 


FFFF_D0 1 8 


Unused (read FF) 




FFFF_D019 


FFFF_D0 1 9 


Unused (read FF) 




FFFF_D020 


FFFF_D020 


HA (Local HA register) 




FFFF_D02 1 


FFFF_D02 1 


HA (Node 0 HA register) 




FFFF_D022 


FFFF_D022 


HA (Node 1 HA register) 




FFFF_D023 


FFFF_D023 


HA (Node 2 HA register) 




FFFF_D024 


FFFF_D024 


HA (Node 3 HA register) 




FFFF_D025 


FFFF_D025 


HA (8240 HA register) 




FFFF_D026 


FFFF_D026 


HA (Software Fail) 




FFFF_D027 


FFFF_D027 


HA (Watchdog Strobe) 




FFFF„D028 


FFFF_DFFF 


4068 Bytes FLASH 




FFFF_E000 


FFFF FFFF 


8K NOVRAM 



(1) Thirtyone 1Mbyte blocks of application memory residing at address FFE0_0000 - FFEF_FFFF selected by the 
FLASH page bits. 

(2) 2Mbyte block available after reset. 

(3) Always available 



4.7.10.2 Register Description 

Reference 10 describes the registers of the MPC8240. 



4.7.10.3 Interrupt 

The MPC8240 contains an embedded programmable interrupt controller (EPIC) device. The EPIC implements the 
necessary functions to provide a flexible and general-purpose interrupt controller solution. The EPIC pools hardware- 
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generated interrupts from many sources, both within the MPC8240 and externally, and delivers them to the processor 
core in a prioritized manner. The solution adopts the OpenPIC architecture {architecture developed jointly by AMD and 
Cyrix for SMP interrupt solutions) and implements the logic and programming structures according to that specification. 
The MPC8240's EPIC unit supports up to five external interrupts, four internal logic-driven interrupts and four timers 
with interrupts. See Reference 1 0 for a detailed description of the EPIC unit. 

The five external interrupt inputs to the EPIC are wired to the external interrupt controller PLD. 

4.7.10.4 MPC8240 Reset 

The MPC8240 can be reset from three sources: a board level reset (RESET_0), JTAG controlled reset, or a failure in 
it's watchdog monitor. Any reset to the MPC8240 shall cause the discrete output registers to reset (low) state, this in 
turn, will cause all G4 nodes to enter the reset state. 

4.7.10.5 Boot Procedure 

After the release of reset to the MPC8240, it will begin executing code out of the FLASH memory. A reset will 
automatically set the FLASHSEL(4:0) bits to all zero's, therefore, the MPC8240's boot code must reside in bank 0. 
Once it's application code is copied to SDRAM, the MPC8240 can then sequence through the FLASH banks by setting 
the appropriate bits in the discrete output word. Application code for the G4 nodes resides in the remaining thirtyone 
U banks of FLASH. 



£ 4.7.11 Bulk FLASH Memory 

■^f There are 32Mbytes of bulk FLASH memory, comprised of two Intel 28F128J3 StrataFLASH memory devices. The 

y=I MPC8240's memory map limits the size of the 8-bit wide FLASH to 2Mbytes, this requires hardware to divide the 

D FLASH into thirty-two 1Mbyte banks. Five software-controlled discretes allow switching between banks. Accesses to 

|p the 1Mbyte address range of FFEO_0000 through FFEFJFFFF will always access the first first block of FLASH, 

NOVRAM,Discrete I/O, HA registers, watchdog monitor, and the interrupt controller. Accesses to the 1Mbyte address 
range of FFFO_0000 through FFFF_FFFF will access a page of memory in the FLASH. The actual page is selected is 
based on the five FLASH select bits, driven by the Discrete Output word. 



o 

y 

4.7.12 Real Time Clock 

sC The PCF8563 is a CMOS real-time clock/calendar optimized for low power consumption. A programmable clock 

p output, interrupt output and voltage-low detector are also provided. All addresses and data are transferred serially via a 

p i two-line bidirectional I 2 C-bus. Maximum bus speed is 400 kbits/s. 

Real Time Clock Features: 

- Provides year, month, day, weekday, hours, minutes and seconds 

(Based on an external 32.768 kHz quartz crystal) 

- Century flag 

- Wide operating supply voltage range: 1.0 to 5.5 V 

- Low back-up current; typical 0.25 mA at VDD = 3.0 V and Tamb =2 °C 

- 400 kHz two-wire I 2 C-bus interface (at VDD = 1.8 to 5.5 V) 

- Programmable clock output for peripheral devices: 32.768 kHz, 1024 Hz, 32 Hz and 1 Hz 

- Alarm and timer functions 

- Voltage-low detector 

- Integrated oscillator capacitor 

- Internal power-on reset 

- I 2 C-bus slave address: read A3H; write A2H 

- Open drain interrupt pin 

4.7.13 Nonvolatile Memory 

The MPC8240 will be equipped with 8Kx8 of non- volatile RAM for the storage of fault record data and configuration 
information. This function is implemented using a SIMTEK STK12C68S45 NO VRAM attached to the local bus 
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interface. The device's data bus is isolated from the local bus through an IDT IDTQS32244SO buffer. This buffer 
provides 3.3v to 5v translation. 

4.7.14 Fault Status and Control Registers 

The MPC8240 has access to five 8-bit status registers. One register represents its own status while the others represent 
that fault status of the other four G4 CPUs. Each register has the identical format as shown in Table 8: 
These five registers grant the MPC8240 status information from each node on the board, without going through the 
Raceway fabric. 

The MPC8240 will have one 8-bit Fault control register. The control register for each CPU will have the following 
format as shown in Table 9: 



Bit 


Name 


Description 


0 


CHECKSTOP_OUT 


Checkstop state of CPU (0 = CPU in checkstop) 


1 


WDM_FAULT 


WDM failed (0 = WDM failed, set high after reset and valid service) 


2 . 


SOFTWAREJFAULT 


Software fault detected (Set to 0 when a software exception was detected) (R/W local) 


3 


RESETREQJN 


Wrap status of the local CPU's reset request 


4 


WDMJNIT 


WDM failed in initial 2 second window ( 0 = WDM failed) 


5 


Software definable 0 


Software definable 0 


6 


Software definable 1 


Software definable 1 


7 


unused 


unused 



Table 8. Fault Status Register Format 



c 



Bit 


Name 


Description 


0 


RESETREQ_OUT_0 


Request a reset event (0 => forces reset) 


1 


CHKSTOPOUT 0 


Request that node 0 enter checkstop state (0 => request checkstop) 


2 


CHKSTOPOUT_l 


Request that node 1 enter checkstop state (0 => request checkstop) 


3 


CHKSTOPOUT_2 


Request that node 2 enter checkstop state (0 => request checkstop) 


4 


CHKSTOPOUT 3 


Request that node 3 enter checkstop state (0 => request checkstop) 


5 


CHKSTOPOUT 8240 


Request that the MPC8240 enter checkstop state (0 => request checkstop) 


6 


Software definable 0 


Software definable 0 


7 


Software definable 1 


Software definable 1 



Table 9. Fault Control Register Definition 



4.7.15 Majority Voter 

There are two different functions controlled by majority voters. The first is local to each CPU, this voter controls the 
assertion of CHECKSTOP_IN to the CPU. The second voter is centralized to the board, it will control the master reset 
to the board. Both voters shall follow the same set of rules: The output will follow the majority of non-checkstopped 
CPUs. A 1-on-l or 2-on-2 condition in either voter will result in a board level reset. 
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4.7.16 Discrete I/O 

There are 16 discrete output signals directly controllable and readable by the MPC8240. The 16 discretes are divided up 
into two addressable 8-bit words. Writing to a discrete output register will cause the upper 8-bits of the data bus to be 
written to the discrete output latch. Reading a discrete output register will drive the 8-bit discrete output onto the upper 
8-bits of the MPC8240's data bus. Table 1 0 defines the bits in the discrete output word. 

There are 16 discrete input signals accessible by the MPC8240. Reads from the discrete input address space will latch 
the state of the signals, and return the latched state of the discretes to the MPC8240. Table 1 1 defines the bits in the 
discrete input word. 



a 



□ 

w 
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Table 1 0. Discrete Output Words 



Word 2 


DH(0:7) 


Signal 


Description 


0 
1 
2 
3 
4 
5 
6 
7 


NDO FLASH EN 1 
ND1 FLASH EN 1 
ND2_FLASH EN 1 
ND3 FLASH EN 1 
Wrap 1 


Enable the CE ASIC's FLASH port when 1 
Enable the CE ASIC's FLASH port when 1 
Enable the CE ASIC's FLASH port when 1 
Enable the CE ASIC's FLASH port when 1 
Wrap to discrete input 



Word 1 


DH(0:7) 


Signal 


Description 


0 


WRAPO 


Wrap to Discrete Input 


1 


I2C RESET 0 


Reset the I2C serial bus when 0 


2 


SWLED 


Software controlled LED 


3 


FLASHSEL4 


Flash bank select address bit 4 


4 


FLASHSEL3 


Flash bank select address bit 3 


5 


FLASHSEL2 


Flash bank select address bit 2 


6 


FLASHSEL1 


Flash bank select address bit 1 


7 


FLASHSELO 


Flash bank select address bit 0 



Word 0 


DH(0:7) 


Signal 


Description 


0 


C 


SRESET3 


0 


Issue a Soft Reset to cpu on Node 3 when 0 


1 


C 


PRESET3 


0 


Reset PCE133 ASIC Node 3 when 0 


2 


C 


SRESET2 


0 


Issue a Soft Reset to cpu on Node 2 when 0 


3 


C 


PRESET2 


0 


Reset PCE133 ASIC Node 2 when 0 


4 


c 


SRESET1 


0 


Issue a Soft Reset to cpu on Node 1 when 0 


5 


c 


PRESET1 


0 


Reset PCE133 ASIC Node 1 when 0 


6 


c 


SRESETO 


0 


Issue a Soft Reset to cpu on Node 0 when 0 


7 


c 


PRESETO 


0 


Reset PCE133 ASIC Node 0 when 0 



Table 1 1 . Discrete Input Words 



Word 1 


DH<0:7) 


Signal 


Description 


0 


WRAP1 


Wrap from discrete output word 


1 


TBD 


2 


V3.3 FAIL 0 


Latched status of power supply since last reset 


3 


V2.5 FAIL 0 


Latched status of power supply since last reset 


4 


VCORE1 FAIL 0 


Latched status of power supply since last reset 


5 


VCOREO FAIL 0 


Latched status of power supply since last reset 


6 


RIOR CNF DONE 1 


RIO/RACE++ FPGA configuration complete 


7 


PXBO CNF DONE 1 


PXB++ FPGA configuration complete 



MCW-1 a Functional Specification Created on 2/2/01 

- 27- 



Page No. 42 



EV 093 931 868 US 
Page No. 69 

Mercury Computer Systems. Inc. COMPANY CONFIDENTIAL 







WordO 


DH(0:7) 


Signal 


Description 


0 


WRAPO 


Wrap from discrete output word 


1 


WDMSTATUS 


MPC8240's watchdog monitor status (0 = failed) 


2 
3 


N PORES ET 1 


Not a power on reset when high 


4 
5 






6 
7 







4.7.17 Interrupt Controller 

The MPC8240 interfaces with an 8-input interrupt controller external from MPC8240 itself. The interrupt inputs are 
wired, through the controller to interrupt zero of the MPC8240 external interrupt inputs. The remaining four MPC8240 
interrupt inputs are unused. 

The Interrupt Controller comprises the following five 8-bit registers; 

Pending Register - A low bit indicates a falling edge was detected on that interrupt (read only) 

Clear Register - Setting a bit low will clear the corresponding latched interrupt (write only) 

Mask Register - Setting a bit low will mask the pending interrupt from generating an MPC8240 interrupt 

Unmasked Pending Register - A low bit indicates a pending interrupt that is not masked out 

Interrupt State Register - indicates the actual logic level of each interrupt input pin 



4.7.17.1 Interrupt Controller Operation 

Table 12 lists the interrupt input sources and their bit positions within each of the six registers. A falling edge on an 
interrupt input will set the appropriate bit in the pending register low. The pending register is gated with the mask 
register and any unmasked pending interrupts will activate the interrupt output signal to the MPC8240's external 
interrupt input pin. Software will then read the unmasked pending register to determine which interrupt(s) caused the 
exception. Software can then clear the interrupt(s) by writing a zero to the corresponding bit in the clear register. If 
multiple interrupts are pending, the software has the option of either servicing all pending interrupts at once and then 
clearing the pending register or servicing the highest priority interrupt (software priority scheme) and the clearing that 
single interrupt. If more interrupts are still latched, the interrupt controller will generate a second interrupt to the 
MPC8240 for software to service. This will continue until all interrupts have been serviced. An interrupt that is masked 
will show up in the pending register but not in the unmasked pending register and will not generate an MPC8240 
interrupt. If the mask is then cleared, that pending interrupt will flow through the unmasked pending register and 
generate an MPC8240 interrupt. 



Table 12. Interrupt Controller Inputs 



Bit 


Signal 


Description 


0 


SWFAIL 0 


8240 Software Controlled Fail Discrete 


1 


RTC INT 0 


Real time clock event 


2 


NODE0 FAIL 0 


WDFAIL_0 or IWDFAIL 0 or SWFAIL 0 active 


3 


NODE1 FAIL 0 


WDFAIL 0 or IWDFAIL 0 or SWFAIL 0 active 


4 


NODE2 FAIL 0 


WDFAIL 0 or IWDFAIL 0 or SWFAIL 0 active 


5 


NODE3 FAIL 0 


WDFAIL 0 or IWDFAIL 0 or SWFAIL 0 active 


6 


PCI INT 0 


PCI interrupt 


7 


XB SYS ERR 0 


XBAR internal error 
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4.7.18 Configuration Junipers 

J 18-1 - J 18-2, the watchdog monitor mask, when installed, will mask all watchdog failures. 

J18-3 - J18-4, the serial EEPROM's write enable jumper, when installed, enables modification of the serial EEPROMs. 
J 18-5 - J 18-6, the flash write-protect jumper, when installed, prevents modification of any flash memory location. 
J 18-7 - J 18-8, the PXBO use PROM jumper, when installed will enable the PXBO's serial configuration PROM. 

4.7.19 LEDs 

There are nine LEDs, visible at the top of the board. 



LD1 is a software controlled LED 

LD2 is a software controlled LED 

LD3 is the Node 0 watchdog fail LED 
U, LD4 is the Node 1 watchdog fail LED 

•*! LD5 is the Node 2 watchdog fail LED 

; 1 LD6 is the Node 3 watchdog fail LED 

^ LD7 is the MPC8240 watchdog fail LED 

*f LD8 indicates the state of the board level reset 

r *%f LD9 indicates a XBAR system error. 

D There are an additional two LEDs on the Ethernet connector for Ethernet status (located on the Ethernet connector). 

m 

~ 4.7.20 Power Supply 

*7; The MCW-la board requires 3.3V, 2.5V, and 1 .8V. There are two 1 .8V supplies, each drives the core voltage for two 

f J cpus. To provide power to the MCW-la, the three voltages must have separate switching supplies, and proper power 

if* sequencing to the device must be provided. All three voltages are converted from 5.0V. The power to the daughter card 

is provided directly from the modem board. 

jl . 4.7.20.1 MPC7400 Core Power Supply 

There are two core voltage power supplies, each one is dedicated to two MPC7400 PPC cores. The core voltage can be 
in the 2.2V to 1 .5V range. This power supply is rated at 12A in the range from 2.2V to 1 .5V. 

4.7.20.2 Main 3.3V Power Supply 

A 3.3V power supply is used to provide power to the SBSRAM core, SDRAM, SCSI, PXB++, and XBAR++ PCE133 
I/O. This power supply is rated at TBD Amp. 

4.7.20.3 Core and I/O 2.5V Power Supply 

A 2.5V power supply is used to provide power to the PCE133 and can also power the PXB++ FPGA core. The 
MPC7400 processor bus can run at 2.5V signaling. The MPC7400 L2 bus can operate at 2.5V signaling. This 2.5V 
power supply is rated at TBD Amp. 

4.7.20.4 ASICs Power Supplies Tolerance Requirements 

SBSRAM VDD = 3.3V+0.165V/-0.165V power supply 

SBSRAM VDDQ = 3.3V+0.165V/-0.165V for 3.3V I/O or 2.5V+0.4V/-0.125V for 2.5V I/O 
SDRAM VDD= 3.3V+0.3V/-0.3V power supply 
XBAR++ VDD= 3.3V+0.3V/-0.3V power supply 
PCE133 VDD= 2.5V+?V/-?V power supply 
PCE133 VDD33= 3.3V+?V/-?V power supply 
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4.7.20.5 Power Supply Voltage Sequencing 

The power sequencing is very important in multivoltage digital boards. It is necessary for long-term reliability. The right 
power supply sequencing can be accomplished by using power_good and inhibit signals. To provide fail-safe operation 
of the device, power should be supplied so that if the core supply fails during operation, the I/O supply is shut down as 
well. 



The general rule is to ramp all power supplies up and down at the same time. This is shown in Figure 5. In reality, ramp 
up and down depend on multiple factors: power supply, total board capacities that need to be charged, power supply 
load, and so on. Figure 6 shows ideal worst-case sequencing for ramp up and down that is performed by the protection 
sequencing circuits shown in Figure 7. This circuit keeps the voltage difference within the required range. 
The MPC7400 requires the core supply to not exceed the I/O supply by more than 0.4 volts at all times. Also, the I/O 
supply must not exceed the core supply by more than 2 volts. 
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D3 D4 

• W-H- 



— w— 

D7 D8 

-W-Oh 

D9 

— w— 



Figure 7. VOLTAGE SEQUENCING CIRCUITS 

;y. 0. 7F voltage drops across one diode. 

During power up sequencing: 

' ; *f Dl and D2 provide the ramp up voltage for the 2.5V power supply as soon as the 3.3V power supply reaches 1 .4V. 

D3 and D4 provide the ramp up voltage for the 1 .8V_1 power supply as soon as the 2.5V power supply reaches 1 .4V. 
Cl D7 and D8 provide the ramp up voltage for the 1 .8V_2 power supply as soon as the 2.5V power supply reaches 1 .4V. 

03 

Q During power down sequencing: 

fll D5 provides the ramp down for the 2.5V power supply as soon as the 3.3V power supply reaches 1 .8V. 

D6 provides the ramp down for the 1 .8V_1 power supply as soon as the 2.5V power supply reaches 1 . 1 V. 

1. D9 provides the ramp down for the 1 .8 V_2 power supply as soon as the 2.5V power supply reaches 1 . 1 V. 

1=1 

W The 3.3V power supply is connected to the VCC3P3 power plane, 

jjs* The 2.5V power supply is connected to the VCC2P5 power plane. 

..J3 The 1 .8V_1 power supply is connected to the VCC 1 P8_l power plane. 

p| The 1 .8V_2 power supply is connected to the VCC1P8_2 power plane. 

|U 

4.7.20.6 Power Supply Monitoring 

A PLD is used to monitor the voltage status signals from the onboard supplies. It is powered up from +5V and monitors 
+3.3 V, +2.5V, 1.8V_1 and +1.8V_2. This circuit monitors the power _good signals from each supply. In the case of a 
power failure in one or more supplies, the PLD will issue a restart to all supplies and a board level reset to the daughter 
card. A latched power status signal will be available from each supply as part of the discrete input word. The latched 
discrete shall indicate any power fault condition since the last off-board reset condition. 
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5 ELECTRICAL INTERFACE 

5.1.1 Power Consumption 

Table 13. MCW-la CN Power Consumption 



Description 


Qty 


Total Typ. Power 


Total Max. Pwr. 


CE ASIC 


1 


1W 


1.5W 


SDRAM 


5 


3W 


3.5W 


SBSRAM 


2 


1.2W 


2.5W 


G4 


1 


8W 


12W 


Oscillator 


1 


0.1W 


0.1W 


PLD 


1 


0.15W 


0.2W 











5* 
O 



Table 1 4. MCW-1 a Power Consumption 



0 

m 



Q 



5.1.2.1 Over-the-Top RACEway++ Interlink 

See Appendix A for the over-the-top RACEway++ interlink connector pinout. 



5.1 .2.2 PCI 32-Bit Modem Connector 

See Appendix B for the PCI 32-bit modem connector pinout. 



5.1.2.3 Ethernet 10/100BT 

See Appendix C for the Ethernet 10/100 BT 



5.1.2.4 PPC Debugger 

See Appendix D for the PPC Debugger connector pinout. 
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6 MECHANICAL 



6.1.1 Packaging 

The MCW-lis a dual-side PCB assembly. The board is designed to be used in a custom system. The MCW- 
1 PCB is TBD thick and TBD layers. 

6.1.2 Physical Constraint 

The PCB board must comply with the Motorola daughter card form factor. 

7 ENVIRONMENTAL 

7.1.1 Temperature & Air Flow 

Operating temperature: TBD 
Storage temperature: TBD 

7.1.2 Humidity 
TBD 



7.1.3 Operating Altitude 

TBD 



7.1.4 Shock & Vibration 

TBD 



7.1.5 Compliance 

TBD 



7.1.6 Reliability 

TBD 



8 SWITCHES & JUMPERS 



J22 Jumper 

Provisional Hotswap s 



J22 Ref. Des. 


Jumper Function 


1-2 


PXBO_HS_HNDL SW high 


2-3 


PXBO HS HNDL SW low 



8.2 Jll Jumper 

Raceway clock master selection 



J11 Ref. Des. 


Jumper Function 


1-2 (open) 


MCW-1 A Master 


1-2 (shorted) 


MCW-IASIave 
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J10 Jumper 

Fl Raceway XBREQI - XBREQO 









J10 Ref. Des. 


Jumper Function 








3-4,5-6 


Straight through 








1 -2,7-8 


Crossover 




8.4 


J4 Jumper 

F2 Raceway XBREQI - 


XBREQO crossover. 








J4 Ref. Des. 


Jumper Function 








3-4,5-6 


Straight through 








1-2,7-8 


Crossover 




8.5 


J3 Jumper 

F2 Raceway CBL_CLK_0 - CBL_CLK_I crossover. 








J3 Ref. Des. 


Jumper Function 








3-4,5-6 


Straight through 








1-2,7-8 


Crossover 




8.6 


J9 Jumper 

Fl Raceway CBL_CLK_0 - CBL_CLK_I crossover. 








J9 Ref. Des. 


Jumper Function 








3-4,5-6 


Straight through 








1-2,7-8 


Crossover 



8.7 J18 Jumper 

Miscellaneous control 



J18 Ref. Des. 


Jumper Function 


1-2 


WDM fail disable 


3-4 


Serial PROM write enable 


5-6 


FLASH write enable 


7-8 


PXBO use configuration PROM 


9-10 


Unused 



J21 Jumper 

Master clock source selector 



J21 Ref. Des. 


Jumper Function 


1-2 


F1 cable port master 


3-4 


F2 cable port master 


Both closed 


MCW-1A master 


Both open 


MCW-1A master 



9 TESTABILITY 
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9.1 JTAG Test Scan 

The MPC7400, MPC8240, PCI-PCI bridge, PCE133 ASIC, PXB++ ASIC, XBAR-H- ASIC, and the Ethernet 
controller provide support for the IEEE Standard 1 149.1 test port (JTAG). Refer to the individual component 
specifications to obtain their JTAG test access port (TAP) descriptions. 

The MCW-la board contains several JTAG scan chains. They provide access to the JTAG test port on the 
MPC7400s, MPC8240, L2 caches, XBAR++, PCE133s, Ethernet, PCI-PCI bridge, and the PXB devices. The 
scan chain is defined as; 
Chain 1 -> MPC740CM 
Chain 2 -> MPC7400_2 
Chain 3 -> MPC7400_3 
Chain 4 -> MPC7400_3 
Chain 5 -> MPC8240 

Chain 6 -> RESET JPLD, PCEFIX1_PLD, NODE0_HA_PLD, N ODE 1 _H A_PLD, PCEFIX2_PLD, 
NODE2_HA_PLD, NODE3_HA_PLD, 8240_DECODE_PLD, VOTER_SYNC_PLD, 8240_HA_PLD, 
PXB_PROM, L2 Cache_l, PCE133_1, L2 Cache_2, PCE133_2, XBAR, L2_Cache_3, PCE133_3, L2 
Cache_4, PCE133_4, PXB++, PCI-PCI Bridge, Ethernet 



The scan path is accessible via connector J 16. The enable for the scan chain buffer is controlled by jumper 
J20. 

The RACEway-H- interlink external connectors will be tested with external loop-back connectors. 

tfl Note: Both the RACEway++ clock (66 MHz) and the PCI clock (33 MHz) must be running to allow the scan path in 

Q the PXB to function properly. 

9 10 RACEway++ Over-the-Top Connector Pinout 

j Table 15. RACEway++ Fl Cable Mode Connector Pinout J-l 



Pin 


Signal 


Pin 


Signal 


Al 


GND 


Bl 


CLK_X_JXl_IO 


A2 


GND 


B2 


JXl_CBL_CLK_IO 


A3 


GND 


B3 


JX1_XBREQ_I 


A4 


GND 


B4 


JX1_XBREQ_0 


A5 


GND 


B5 


JXl_XBSTROBIO 


A6 


GND 


B6 


JXl_XBRPLYIO 


A7 


GND 


B7 


JXl_XBRDCONIO 


A8 


GND 


B8 


JXl_XBIO00 


A9 


GND 


B9 


JXl_XBIO01 


AlO 


GND 


BIO 


JXl_XBIO02 


All 


GND 


Bll 


JXl_XBIO03 


A12 


GND 


B12 


JXl_XBIO04 


A13 


GND 


B13 


JXl_XBIO05 


A14 


GND 


B14 


JXl_XBIO06 


A15 


GND 


B15 


JXl_XBIO07 


A16 


GND 


B16 


JXl_XBIO08 


A17 


GND 


B17 


JXl_XBIO09 


A18 


GND 


B18 


JXl_XBIO10 
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A19 


GND 


B19 


JXl_XBIOH 


A20 


GND 


B20 


JX1_XBI012 


A21 


GND 


B21 


JX1_XBI013 


A22 


GND 


B22 


JX1_XBI014 


A23 


GND 


B23 


JX1_XBI015 


A24 


GND 


B24 


JXl_XBIO!6 


A25 


GND 


B25 


JX1_XBI017 


A26 


GND 


B26 


JX1_XBI018 


A27 


GND 


B27 


JX1_XBI019 


A28 


GND 


B28 


JXl_XBIO20 


A29 


GND 


B29 


JX1_XBI021 


A30 


GND 


B30 


JX1_XBI022 


A31 


GND 


B31 


JX1_XBI023 








JX1_XBI024 


A33 


GND 


B33 


JX1_XBI025 


A34 


GND 


B34 


JX1_XBI026 


A3 5 


GND 


B35 


JX1_XBI027 


A36 


GND 


B36 


JX1_XBI028 


A37 


GND 


B37 


JX1_XBI029 


A3 8 


GND 


B38 


JXl_XBIO30 


A39 


JX1_XBPAR 


B39 


JX1_XBI031 


A40 


+3.3V 


B40 


R_RST_JX 



Table 1 6. RACEway++ F2 Cable Mode Connector Pinout J-2 



Pin 


Signal 


Pin 


Signal 


Al 


GND 


Bl 


CLK_X_JX2_IO 


A2 


GND 


B2 


JX2_CBL_CLK_IO 


A3 


GND 


B3 


JX2_XBREQ_I 


A4 


GND 


B4 


JX2_XBREQ_0 


A5 


GND 


B5 


JX2_XBSTROBIO 


A6 


GND 


B6 


JX2_XBRPLYIO 


A7 


GND 


B7 


JX2_XBRDCONIO 


A8 


GND 


B8 


JX2_XBIO00 


A9 


GND 


B9 


JX2_XBIO01 


A10 


GND 


BIO 


JX2_XBIO02 


All 


GND 


Bll 


JX2_XBIO03 


A12 


GND 


B12 


JX2_XBIO04 


A13 


GND 


B13 


JX2_XBIO05 


A14 


GND 


B14 


JX2_XBIO06 


A15 


GND 


B15 


JX2_XBIO07 


A16 


GND 


B16 


JX2_XBIO08 


A17 


GND 


B17 


JX2_XBIO09 


A18 


GND 


B18 


JX2_XBIO10 


A19 


GND 


B19 


JX2_XBIOH 
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a 
•■■ 



A20 


GND 


B20 


JX2_XBI012 


A21 


GND 


B21 


JX2_XBIOB 


A22 


GND 


B22 


JX2_XB1014 


A23 


GND 


B23 


JX2_XBI015 


A24 


GND 


B24 


JX2_XBI016 


A25 


GND 


B25 


JX2_XBIOI7 


A26 


GND 


B26 


JX2_XBI018 


A27 


GND 


B27 


JX2_XBI019 


A28 


GND 


B28 


JX2_XBIO20 


A29 


GND 


B29 


JX2_XBI021 


A30 


GND 


B30 


JX2_XBI022 


A31 


GND 


B31 


JX2_XBI023 


A32 


GND 


B32 


JX2_XBI024 


A33 


GND 


B33 


JX2_XBI025 


A34 


GND 


B34 


JX2_XBI026 


A35 


GND 


B35 


JX2_XBI027 


A3 6 


GND 


B36 


JX2_XBI028 


A37 


GND 


B37 


JX2_XBI029 


A3 8 


GND 


B38 


JX2_XBIO30 


A39 


JX2_XBPAR 


B39 


JX2_XBI031 


A40 


+3.3V 


B40 


R_RST_JX 
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1 1 Modem Board Connector Pinout 

Table 17. Modem Board Connector Pin Assignments 



J29 


Pin 


Signal 


Signal 


Pin 


1 


5V 


PMC_AD0 


2 


3 . 


5V 


PMC_AD1 


4 


5 


5V 


PMC_AD2 


6 


7 


5V 


PMC_AD3 


8 


9 


PCI_RST_0 


PMC_AD4 


10 


11 


GND 


PMC_AD5 


12 


13 


GND 


PMC_AD6 


14 


15 


PMC_IDSEL_1 


PMC_AD7 


16 


17 


5V 


PMC_AD8 


18 


19 


5V 


PMC_AD9 


20 


21 


PMC_TRDY_0 


PMC_AD10 


22 


23 


GND 


PMC_AD1 1 


24 


25 


GND 


PMC_AD12 


26 


27 


PMC_STOP_0 


PMC_AD13 


28 


29 


5V 


PMC_AD14 


30 


31 


5V 


PMC_AD15 


32 


33 


PMC_PERR_0 


PMC_AD16 


34 


35 


GND 


PMC_AD17 


36 


37 


GND 


PMC_AD18 


38 


39 


PMC_SERR_0 


PMC_AD19 


40 


41 


5V 


PMC_AD20 


42 


43 


5V 


PMC_AD21 


44 


45 


CLK_PMC 


PMC_AD22 


46 


47 


GND 


PMC_AD23 


48 


49 


GND 


PMC_AD24 


50 


51 


PMC_C_BE0 


PMC_AD25 


52 


53 


PMC_C_BE1 


PMC_AD26 


54 


55 


5V 


PMC_AD27 


56 


57 


5V 


PMC_AD28 


58 


59 


PMC_C_BE2 


PMC_AD29 


60 


61 


PMC_C_BE3 


PMC_AD30 


62 


63 


GND 


PMC_AD31 


64 


65 


GND 


5V 


66 
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67 


GND 


PMC_FRAME_0 


68 


69 


PMC_INTA_0 


GND 


70 


71 


GND 


PMC_IRDY_0 


72 | 


73 


GND 


5V 


74 


75 


PMC_GNT_0 


PMC_DEVSEL_0 


76 


77 


5V 


PMC_LOCK_0 


78 


79 


PMC_REQ_0 


PMC_PAR 


80 



p 
m 



MCW-la Functional Specification 



Created on 2/2/01 
- 39- 



Page No. 54 



EV 093 931 868 US 
Page No. 81 

Mercury Computer Systems. Inc. COMPANY CONFIDENTIAL 

12 Processor JTAG Connector Pinout 

The JTAG connectors are unique to each processor. Table 18 shows the generic signal names on each connector pin, the 
actual names will have each processor's extension appended to the generic signal name. 
Table 18. JTAG Jx Connectors Pin Assignments 



Jx- 


SIGNAL 


Jx- 


SIGNAL 


1 


TDO 


2 


QACKN 


3 


TDI 


4 


TRSTN 


5 


HALTEDN 


6 


3.3V 


7 


TCK 


8 


CKSTOP INN 


9 


TMS 


10 


N.C. 


11 


SRESETN 


12 


N.C. 


13 


HRESETN 


14 


«key» 


15 


CKSTOP_OUTN 


16 


GND 
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13 Non-Processor JTAG Connector Pinout 

The non-processor JTAG connector ties together all the remaining JTAG capable devices together. Table 18 shows the 
signal names on each connector pin. The connector is designed to only include the programmable PLDs and PROM 
when the program cable is installed, or the entire chain when the Boundary scan test connector is installed. 
Table 19. JTAG Jl 6 Connectors Pin Assignments 



J16- 


Signal 


Description 


1 


TMS JTAG 


JTAG Test Mode Select 


2 


TDI_JTAG 


JTAG Test Data In 


3 


TDO JTAG 


Boundary Scan Test Data Out 


4 


TESTN 


Driven low when connector inserted 


5 


TCKJTAG 


JTAG Test Clock 


6 


GND 


Ground on module 


7 


PXB_CNF_TDO 


TDO from end of PLD chain 


8 


TDI NDO 


TDI into non-PLD Chain 


9 


+5V 


+5V Power on Module 


10 


TEST 


Driven high when connector inserted 



TMS 
TDI 
TCK 
TDO 
Power 



PLD Program Configuration 

-) [ PLD PLD PROM 



- J 16-1 TMS_JTAG 
-J16-2TDUTAG 

- J 16-5 TCK_JTAG 

- J16-7 PXB_CNF_TDO 

- J 16-9 Power 
" TESTN 

- GND 



TMS 
TDI 
TCK 



TDO - 
Power - 



Boundary Scan Test Configuration 



- J 16-1 TMS_JTAG 

- J 16-2 TDI_JTAG — 

- J 16-5 TCK_JTAG 

- J16-7PXB_CNF_TDO 

- J16-8 TDI_ND0 

" J16-3TDO_JTAG 

- J1 6-9 Power 
- TESTN 

- GND 



■i prom I — | 
-QQ-Qn 



Figure 8. JTAG Connector configuration options 
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14 Design Notes 

1 4.1 MPC7400 and Nitro Bus 





1.8V 


2.5V 


3.3V 


MPC7400 V 60x 


Yes 


Yes 


Yes 


MPC7400 V L2 


Yes 


Yes 


Yes 


Nitro V 60x 


Yes 


Yes 


No 


Nitro V L2 


Yes 


Yes 


No 


PCE133 V60x 


No 


Yes 


No 


SBSRAM Vi/o 


No 


Yes 


Yes 



14.2 Bypass Capacitors Selection 

(Based on App. Note from Micron TN-00-06) 



L& Vcore = 3.3V +/- 0.165V, which is 5% 

q Vi/o = 2.5V +/- 0.125V, which is 5% 
• 

'~f When the SBSRAMs are driving 21pf load from 0V to 2.5V with Ins edges, the transient current is: 

~ I = (C * dV)/dt = (30pf*2.5V)/lns = 75ma per one I/O pin. 

O For 36 I/O, 36*75ma = 2.7A in Ins time interval. 

m 

a The SyncBurst SRAM has a VDD tolerance of 3.3V +/-0.165V. Considering some droop from the power bus and a switching 

f=| time of 1 n s, and allowing a maximum voltage dip (DV) on the SRAM of -0.05 V, the choice of bypass capacitor becomes: 

fj C = ( I * dt)/dV = (2.7A * l)/0.05 = 54nF per one SBSRAM. 

#S Choosing 6 x 1 Onf allows some margin. 

Q It is better to use reverse ratio capacitors 0508, 0406, or 0204. 

f| ; The low ESR is also very important. 

Temperature stable dielectric as X7R. 

From Vishay VJ0402 style X7R. 

14.3 Tantalum Capacitors Selection 

Ultra-low ESR tantalum capacitors T510 are used in the switching power supply, besides several bulk storage capacitors 
distributed around the PCB that feed Vcore and Vi/o plains, to enable quick recharging of the bypass chip capacitors. 
The number of the bulk-storage tantalum capacitors depends on the power supply response time characteristic. 

The MPC7400 can go from nap mode to full-on mode power within two cycles. 

I core = (10W -2W) /1.8V = 4.5A 
dt= 10|is 

C = (I * dt)/dV = (4.5A * 1 0|us) / 0.05V = 900fiF 



MCW- 1 a Functional Specification Created on 2/2/0 1 

- 42- 



Page No. 57 



EV 093 931 868 US 
Page No. 84 



TO 

FROM 
ABOUT 



VERSION 

DATE 

COPIES TO 



I 



■I m Computer Systems, Inc. 



MEMORANDUM 



Alden Fuchs 

Preliminary Framework interface 

Memorandum # AF-4 
V0.2 

8 December, 2000 



o 
m 



DISTRIBUTION 



Page No. 



EV 093 931 868 US 

Page No. 85 MEMO AF-4 

Prototype Framework V0.1 



1. Introduction 3 
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2.3. 1 . An Example of the receiving of data from input pin 0 6 

2.4. Send Output 7 

2.4.1. An Example of the sending data on output pin 0 7 

3. Transforms for WCDM Simulation: 8 

, . 3.1. handset (one of n): g 

«! 3.1.1. input pins: g 

" 3.1.2. Output pins: g 

Jjj 3.2. Chan (set of one to m objects): g 

^ : 3.2.1. Input pins: g 

3 . 2.2 . Output pins: 9 

3.3. broadcast (set of one to k objects): 9 

3.3.1. Input pins: 9 

j" 3.3.2. Output pins: 9 

~ 3.4. Rake (one of n): 9 

u\ 3.4.1. Input pins: 10 

3.4.2. Output pins: 10 

V 3.5. MUX (set of one to L objects): 10 

3.5.1. Input pins: 10 

m 3.5.2. Output pins: 10 

3.6. MUD (one object for now): 10 

3.6.1. Input pins: H 

3.6.2. Output pins: \\ 

3.7. BER (set of one to m objects): 1 1 

3.7.1. Input pins: \\ 

3.7.2. Output pins: \\ 



2 of 11 

Page No. 59 



EV 093 931 868 US 
Page No. 86 



MEMO AF-4 
Prototype Framework V0.1 



1. Introduction 

This is a very brief description of the prototype framework and how to use it. The purpose of this 
memo is to describe the software interfaces from within a transform object. 




The above figure depicts the software architecture, and the transform object is a part of the 
Application that is managed by the Application framework. 



3 of 11 



Page No. 60 



EV 093 931 868 US 
Page No. 87 



MEMO AF-4 
Prototype Framework V0.1 



WCG Framework 




*_ 1.1. Transform Object 

The transform object is the basic building block and can be like a Turbo -coder, QAM modulator 
W etc. 

°r 1.2. Red-Box 

The red-box collects transform objects into a logical grouping that describes all of the processing 
W that will be carried out on a single CPU.. (Note for reasons of non real-time operation eg 

simulation collections of red-boxes can be on a single CPU). 
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O 

m 



2. Transform Object Sample 

2.1. Include the following files to define the interface, and variables 
required.. 



#include "mc_error.h" 
#include "mcwl.h" 
#include "dx_dma.h" 
#include "dx_dma_var.h" 

2.1.1. The contents of dx_dma_var.h: 

int my_logical_ce; 
CONFIG_data *ptr_conf ig_base ; 
CONFIG_data *ptr_cur_conf ig; 
M> CONFIG_data *ptr_tmp_conf ig; 

F= int active_in_ce [ (MAX_CE+1) * MAX_CHAN] ; 

'% int active_in_ch[(MAX_CE+l) * MAX_CHAN] ; 

*f int active_in_buf_size[(MAX_CE+l) * MAX_CHAN] ,- 

^ char *active_in_buf [(MAX_CE+1) * MAX_CHAN] ; 

int act ive_in_index ; 



int active_out_ce [ (MAX_CE+1) * MAX_CHAN] ; 

int active_out_ch[ (MAX_CE+1) * MAX_CHAN] ; 

int active_out_buf_size [(MAX_CE+1) * MAX_CHAN] ; 

char *active_out_buf [ (MAX_CE+1) * MAX_CHAN] ; 

int active out_index; 



r Z #define dma_send_j>in (pin) \ 

4= dma_send ( \ 

Q my_logical_ce, \ 

= - active_out_ce [pin] , \ 

active_out_ch [pin] , \ 
(char **) &active_out_buf [pin] \ 
) 

#define dma_rec_pin (pin) \ 

dma_rec ( \ 
active_in_ce [pin] , \ 
my_logical_ce, \ 
active_in_ch [pin] , \ 
(char **) &active_in_buf [pin] \ 
) 

2.2. Initialize the interface 

// get config SMB 

dma_all_init( 
my_logical_ce, 
active_in_ce, 
active_in_ch, 
active_in_buf_size, 
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active_in_buf, 

(int *)&active_in_index, 

active_out_ce, 

active_out_ch, 

active_out_buf_size, 

active_out_tmf, 

(int *)&active_out_index, 

(CONFIG_data **)&ptr_config_base 

); 

ptr_cur_config = &ptr_config_base[my_logical_ce]; 
#ifdef debug_print 

printf("Vir CE %i, module name is %s\n", 
my_logical_ce,ptr_cur_config->module_name); 
1^ #endif 

|h ptr_cur_config->state = STATE_RDY; /* ail init done now ready */ 

S //wait for rx to be ready 

2 ptr_tmp_config = &ptr_config_base[active_out_ce[0]]; 

while (ptr_tmp_config->state != STATE_RDY) //need reciver to be ready 
J sched_yield(); 
O //wait for tx to be ready 

CP ptr_tmp_config = &ptr_config_base[active_in_ce[0]]; 

„ while (ptr_tmp_config->state != STATEJRDY) //need reciver to be ready 

P sched_yield(); 

£7j #ifdef debug_print 

P printf("\nCE %i, Virtual CE %i, Starting\n",(int)ce_getid(),my_logical_ce); 

V #endif 

fy 2.3. Receive input 

Receive input data if required, input pins can be left unused. 

2.3.1. An Example of the receiving of data from input pin 0 

/* get data from other CE */ 
rc = dma_rec_pin(0); 
ERROR_MCWl(rc); 

OR 

rc = dma_rec( 

active_in_ce[0], 
my_logical_ce, 
active_in_ch[0], 
(char **)&active_in_bufIO] 

); 

ERROR_MCWl(rc); 
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The data is available in the active_in_buf pointer,, note this always points to the next available 
input buffer in the case of multi -buffering,, at a later date the size of input chunk and offset will be 
provided so that a FIFO like structure can be used. 

2.4. Send Output 

Send output data if required, output pins can be left unused. 

2.4.1. An Example of the sending data on output pin 0 

/* send data to other CE */ 
rc = dma_send_pin(0); 
ERROR_MCWl(rc); 



OR 



,11 

o 
m 



rc = (long)dma_send( 
my_logical_ce, 
active_out_ce[0], 
active_out_ch[0], 
(char **)&active_out_buf[0] 

); 

ERROR_MCWl(rc); 

The data in the active_out_buf pointer will be sent, on return this always points to the next 
available output buffer in the case of multi -buffering. At a later date the size of output chunk and 
offset will be provided so that a FIFO like structure can be used. 
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3. Transforms for WCDM Simulation: 

3.1 . handset (one of n): 

This object has two input pins and one output pin. It performs the: 

1 . Generate transport channel 

2. MUX and channel coding 

3. Generate TX waveform 

4. Simulate RX system for Power control etc. 

5. Outputs to the chan model 

3.1.1. input pins: 
3.1.1.1. power_control pin 0 : 

Input to this pin is from output pin 0 of the rake block and is the slot power control. 
Q 3.1.1.2. next_chunk pin 1: 

Q Input to this pin is from output pin 1 of the BER block and is the send next n symbols for 

ygt processing e.g. 2 symbols, or a slot etc. 

m 

yp 3.1.1.3. next_chunk pin 1: 

P Optional input pin, used to provide external ie outside of the Generate traffic channel bits, access 

jj| to the raw data input ie if we did a codec the output of the codec would go into this block. 

^ 3.1 .2. Output pins: 

H 3.1.2.1. signaLout pin 0 : 

|7 This pin goes to one input pin of the chan object group. 

S 3.1 .2.2. rawjbits pin 1 : 

iff This pin has the raw data bits as encoded into the Data channel so that the BER, BLER 

calculations can be done. 

3.2. Chan (set of one to m objects): 

In this group of objects, each has; two to n input pins; and one output pin each. They collectively 
perform the: 

1. Channel model for each of the inputs except the carry pin 

2. Sums the local signals, and adds the carry input pin 

3. Outputs to the front_end object to send same data to all rake inputs 

3.2.1. Input pins: 

3.2.1.1. sum in pin 0 : 

Input to this pin is from output pin 0 of other channel object, currently a dummy input is required 
on this pin for the process to fire (needs more thought ie a special first chan??). 

3.2.1.2. signal_in pin 1 to n: 

Input to this pin is from output pin 0 of the handset block. 



ri 
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3.2.2. Output pins: 

3.2.2.1. signal_out pin 0 : 

This pin goes to input pin 0 of the broadcast object. 

3.3. f ront_end (one object): 

In this object, each has; one input pin; and one output pin. It performs the: 

1 . Add^e-n : «^teple-aft^ a n d-ethef Receiver distortions and noise 

2. Simulate RX system (AGC, A/D, multiple antennas) etc. 

3. Outputs to the broadcast object to send same data to all rake inputs 

Multi ple antennas should be treated as separate data streams. The rake receiver will process them 
independently, until the MRC stage. 

1=1= 3.3.1. Input pins: 

£5 3.3.1.1. signal_in pin 0 : 

Input to this pin is from output pin 0 of the last channel object. 

Jj 3.3.2. Output pins: 

Q 3.3.2.1. signal_out pin 0 ton: 

p This pin goes to input pin 0 of the broadcast objects, 

p 3.4. broadcast (set of one to k objects) : 

yj This object is required to simulate broadcast, until the simple framework supports this feature, we 

|4 need this object. 

j; Each object in the group has one input pin and one to n output pins. They collectively perform 

1. Takes one input and copies it to all of the output pins un-modified 

2. Outputs same data to all rake input 0 pins. 

3.4.1. Input pins: 

3.4.1.1. signaljn pin 0 : 

Input to this pin is from output pin 0 of the front_end object. 

3.4.2. Output pins: 

3.4.2.1. signal_out pin 0 to n: 

This pin goes to input pin 0 of the rake objects. 

3.5. Rake (one of n): 

This object has one input pin and two output pins. It performs the: 

1. A-GGr AFC 

2 . Initial signal acquisition and .vSearcher re ceiverR X 

3. Multiple finger receh ersR* 
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4. Channel estimation, MRC etc. 

5. Final data channel despreading. 
&r6. Outputs to: 

• MUD group of processes 

•_ Soft-decision s\mbol processing (FEC decod ing and demultiplexing (25.212) 



3.5.1. Input pins: 

3.5.1.1. signal_in pin 0 : 

This is the data from the broadcast set, and carries the signals of all the handsets, and noise etc. 

3.5.2. Output pins: 

|=* 3.5.2.1. power_control pin 0 : 

9 This is the slot power control to be sent back to the handset. 

Q 

iijp 3.5.2.2. signal_out pin 0 : 

yQ This pin goes to one input pin of the MUX object group. 

i 

Q 3.6. M UX (set of one to L objects) : 

CP This object is required to gather and package information from the 1 to n rake objects. The inputs 

si are placed into packets(???) or into arrays (???) To Be Determined (TBD). This object should be 

p morphed into the best approximation of the packaging to be provided by a targeted modem. 

ff Each object in the group has one to n input pins and one output pin. They collectively perform 

%I the: 

p 1 - Package rake information into simulated modem sourced data, 

pj 2. Outputs to MUD input 0 pin (for now until MUD integration there will be a dummy 

placeholder block). 

3.6.1. Input pins: 

3.6.1.1. signaljn pin 0 to L: 

Input to this pin is from output pin 1 of the a rake object, or another MUX objects output pin 0 . 

3.6.2. Output pins: 

3.6.2.1. signal_out pin 0 : 

This pin goes to input pin 0 of the rake objects. 
3.7. MUD {one object for now): 

This object is required to place hold until a real mud is implemented. 
MUD has one input pin and one output pin. 

1 . Passes through data and formats it for the BER block 

2. Outputs to BER input 0 pin. 
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3.7.1. Input pins: 

3.7.1.1. signaMn pin 0 : 

Input to this pin is from output pin 0 of the MUX object. 

3.7.2. Output pins: 

3.7.2.1. signal_out pin 0 : 

This pin goes to input pin 0 of the BER object. 

3.8. BER (set of one to m objects): 

This object is required to gather and package information from the 1 to n handset objects and the 
MUD. The inputs are placed into packets(???) or into arrays (???) To Be Determined (TBD). 
This object should be morphed into the best approximation of the packaging to be required by a 
targeted modem. It also compares the raw input data and raw received data. It also does the FEC 
iM[ detection and correction and Block error rate. 

p Each object in the group has one to n input pins and one to n+1 output pins. They collectively 

yp perform the: 

! J3 1 . Package rake/MUD information into simulated modem destination data. 

2. Perform all of the bit level processing, interleaving, FEC, - This should be i n a separate block , 
y 3. BER. BLER etc. BLER shouid be done \ la the CRC check, after all svmbol decoding is 

performed. 

3t4. Outputs to GUI input 0 pin to display the stats. 
{ 4r5. Outputs the generate the next slot command to the one to n handsets. 

3.8.1. input pins: 

"** 3.8.1.1. signaMn pin 0 torn: 

Input to this pin is from output pin 0 {for now until MUD integrated} of the MUD object, or 
Hi another output pin 0 of a BER object. 

3.8.2. Output pins: 

3.8.2.1. stats_out pin 0 : 

This pin goes to input pin 0 of the host object for display of data on the GUI. 

3.8.2.2. next_slot pin 1 (one of n): 

This pin goes to input pin 1 of the handset object to indicate the system is ready for the next slot 
of data. 
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From: Jon Greene <greene @ mc.com> 

To: "Lauginiger, Frank" <fpl@mc.com>, <joates@mc.com>, <afuchs@mc.com>, 

<mvinskus@mc.com> 

Date: 6/23/00 3:05PM 

Subject: Some MUD analysis 

All: 

Obviously, I've been thinking about MUD a lot. Below is some analysis. 

First, some news. We apparently have 400 Mhz, 2 meg / 266 Mhz L2 Nitros in 
house (samples). Vitaly is presently working to bring them up. This is 
excellent news. Besides the above speed/size properties, Nitros use 
significantly lower power than Max's and allow for varying L2 configuration 
options. Nitro L2's can be configured the normal way (as a cache) or all or 
half (1 meg) as SRAM memory and can be addressed as such directly. For 
example, one can write a buffer into this memory with vmov or, better yet, 
as the output of some computation. I'm not sure if it could be the source or 
. target of a RACEway xfer but we should try to find this out. Even if 

Cl configured as a coherent cache, it can be easily locked and unlocked in user 

U mode. I think configuring as 2 meg of SRAM may work the best for MUD but we 

O should determine this empirically. 

y!3 

J3 Now, a critical analysis of ops, buffer sizes, bandwidth, access patterns, 

. q algorithm structure and phases of the moon, are all essential to arriving at 

p. a strategy that stands a chance of working. This of course is not easy 

!rf because various techniques impact all of the above in unequal ways. Let's 

u 1 just consider the R1/R1 m R-matrix processing on the above Nitro with a 

maximum of 100 users. 'Without* taking advantage of the diagonal symmetry in 
Q the Corr matrix, which I now believe will be very difficult to do in the 

hj R-matrix ucoded processing loop(s) (we should discuss this), but still 

[2 assuming Corr *can* effectively exist as a byte matrix without degrading 

accuracy beyond acceptability, a single plane (i.e., a processor's worth) of 
the Corr matrix requires 200 * 200 * 32 = 1 ,280,000 bytes which fits, albeit 
© uncomfortably, into the L2. At 2 gigabyes/sec (~ 266 * 8), this matrix (if 

flj L2 resident) can theoretically be consumed in 0.64 ms (remember, 1.33 ms. is 

our budget). Now, *if* we go with a completely separate X matrix calculation 
without stripmining *and* we also store it as byte values, it would require 
at most 100 * 100 * 32 = 320,000 bytes. This must be entirely produced and 
consumed in the 1 .33 ms. time slice. In 'theory*, this can be done in 0.32 
ms. Finally, the R1_temp output is of size 200 * 200 = 40,000 bytes and can 
be produced in .02 ms. So, with the fully separate X matrix approach and no 
symmetry in the Corr, we theoretically require ~1 ,750,000 bytes of buffer 
size (I added a little more for stray stuff such as the C vectors and the 
phys <=> virt Luts, etc.) and ~1 .0 ms. to produce and consume these buffers. 
If we stripmined X, which seems a better way to go, we could hopefully keep 
it resident in L1 , thereby reducing L2 buffers to ~1 ,350,000 bytes and 0.7 
ms of L2 I/O. The stripmining also allows us the option of keeping the X 
strip as shorts rather than bytes. 

Now lets consider the ops count. For the R1/R1m processing (including the 
generation of the X matrix and 2 antennas), I come up with (2 * 6 * 1 00 * 
100 * 16 + 4 * 200 * 200 * 16) * 750 = (1920,000 + 2,560,000) * 750 = 3.36 
GOPS. (BTW, if you were wondering, 750 = 1000/1.33.) The R0 processing has 
less GOPS due to the symmetry. I get (1920,000 + 2,560,000/2) * 750 = 2.40 
GOPS. Since the R0 and R1/R1m processing use the same X matrix, we may be 
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tempted to consider having only the R0 processor compute the X matrix and 
ship it to the R1/R1m processor. This looks nice from a GOPS perspective (R0 
= 2.40, R1/R1m = 1.92) but I'm not sure it will work very well given the 
lockstep nature of the processing pipe. For example, will the R1/R1m 
processor simply be idle waiting for the X matrix or will it be completing 
the *prior* R1_temp processing while the R0 processor is computing the 
current X? 



But the real killer about having R0 ship X to R1/R1 m is that the X matrix 
(320,000 bytes) will take at least 1 .23 ms. over RACE++ 
(320,000/260,000,000). And let's not forget the 40,000 byte R_temp output 
matrix that has to also be shipped out in the same time frame. So I don't 
think this OPs balancing approach will work. 

We therefore appear to require 3.36 GOPS out of R1/R1m and we might just not 
even bother with the R0 symmetry since it doesn't buy you very much given 
that mpic needs both R0 and R1/R1m as inputs. In other words, have both 
R-matrix processors run essentially the same code. (Will this work?) 

LU n ow 3.36 GOPS out of one processor is a tall order. We may have to resort to 

p a more asymmetric division of labor (The R0 processor takes advantage of the 

;Q R0 symmetry and also does a portion of R1/R1 m). But, I'd like to pursue the 

J more balanced division until we are absolutely sure it won't work. 

^ 11 tnis approach, both the R0 and R1/R1 m processors independently produce 

w and consume X in strips. A variant could instead produce and consume a 

0 single "value" (actually 32 shorts) of X in a single ucode primitive that 

fjl does both the complex multiplies and the dot products (the MUDder of all 

s primitives). The former is certainly the easier approach and might get us 

g all the way there but the latter, if it can be cleverly coded, may perform 

ff, better. In all cases, the ops don't change but at least the L2 gets some 

W breathing room. 

In any event, the so -called dot-product loop, whether it's separate or 
P includes the complex multiply, still remains a difficult piece of code to 

n 1 1 fully optimize if we allow the number of virtual to physical users to vary 

as MUD (and Dr. Oates) demands. Using a LUT to acquire the index list and 
count of virtual users for a given physical user will tend to throttle the 
dot product code due to short vector lengths, funny address calculations, 
and "random" load and store patterns. The load isn't so bad since it's two 
cache lines no matter where it comes from. We may want to reorder Corr 
anyway just to ease the address arithmetic and DST logic. We could also 
simply store in the order we produce and leave it to the mpic processor to 
reorder (poor guy). As for the short vector count, I think this can be 
overcome with a clever primitive that "pauses" as little as possible between 
index lists but this will take some careful design. 

I think we should try for the "balanced" stripmine approach with essentially 
the same two primitives running in each processor. In the absence of 
dissenting views, I will continue modifying the C code to realize this 
structure. I'm still not sure where the Amp/fac_xx multiply(s)/shift(s) 
belong but for now I'll rid them entirely from the R-matrix functions that 
I'm preparing for ucoding. 



- Jon 
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"Kenny , Jamie" <jfk@mc.com> 
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To: Wireless Communications Group 
From: J. H. Oates 

Subject: Channel Estimation Date: October 20, 2000 



O 1- Introduction 

G 

% In the conventional RAKE receiver, channel amplitude 1 estimation is required for maximal 

g ratio combining the RAKE fingers. The BER performance is not strongly dependent on the 

g accuracy of the channel amplitude estimates. For Multi-User Detection (MUD) the channel 
amplitude estimates are used for signal subtraction, and accuracy of the channel 
amplitude estimates is more critical. In addition, the channel estimation error is larger 

q when MUD is used since channel estimation is performed in a higher interference 

y environment. This report investigates the accuracy of the conventional channel amplitude 

y, estimation techniques under elevated multiple access interference. The effect of channel 

£ amplitude estimation error on MUD efficiency is then assessed. The analysis presented 

Q here is intended to be a first-look. There are a number of ways to increase the channel 

jfy amplitude estimation accuracy. A few of these are discussed below. 

Section 2 presents a model for the received signal and match-filter outputs. The effect of 
channel estimation error on MUD efficiency is addressed in section 3. In section 4 the 
accuracy of the conventional channel amplitude estimates is assessed. In section 5 
improved single-user methods are presented for channel amplitude estimation. Section 6 
presents a multi-user channel amplitude estimation method. Section 7 addresses the 
effect of uncancelled multipath on the MUD efficiency, which is used in section 8 to 
assess the effect of dropping small amplitudes. It is shown that the overall MUD efficiency 
is improved by dropping small amplitudes. Conclusions are drawn in section 9. 

2. Signal Model and Matched-Filter Outputs 

The baseband received signal can be written 



1 Amplitudes are complex and hence include magnitude and phase. 
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r W = II ? t ^- ffl ^W + n{/] (1) 

where t is the integer time sample index, T = NN C is the data bit duration, N = 256 is the 
short-code length, N c is the number of samples per chip, w[t]\s receiver noise, and where 
s k [t] is the channel-corrupted signature waveform for virtual user k. For L multipath 
components the channel-corrupted signature waveform for virtual user k is modeled as 

?*M = 5/W'- T *] ( 2 ) 
p=\ 

where a kp are the complex multipath amplitudes. Notice that a kp = a lp if k and / are two 
virtual users corresponding to the same physical user. This is due to the fact that the 
signal waveforms of all virtual users corresponding to the same physical user pass 
through the same channel. For multiple antennas a kp is a vector. For dual antennas, for 
example, primary and diversity, 



fc] 



(3) 



jj The waveform s k [t] is referred to as the signature waveform for the kth virtual user. This 

| waveform is generated by passing the spreading code sequence c k [n] through a pulse- 

m shaping filter g[t] 

H s k [t] = %g[t-rN c ]c k [r] (4) 

rf Tt 

p 

!f: where N = 256 and g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine 

}jj pulse as opposed to a root-raised-cosine pulse, the received signal r[t] represents the 

baseband signal after filtering by the matched chip filter. Note that for spreading factors 

less than 256 some of the chips c k [r] are zero. 

Combining Equations (1) through (4) gives 

K v L 

m = t' ~ MT ~ T * P + (5) 

The output of the despreading operation for a single multipath component is the complex 
statistic 
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y Ig [«1 s X r E nA ^ + * „ + m7] • c; [n] 



= X X |X {^;2>* [ ^ + (m ~ m)T +f >< ~ r *p ] • c * [n] } a^+v^ 

■C lkqp [m!yb k [m-m:] + w lq [m\ 



(6) 



w „ ["»] = X w[rtiV c + + ml] • C ; [n] 

where f /? is the estimate of x lq , and N, is the (non-zero) length of code c,[n]. The values 
y, q [m] are complex and are referred to as the pre-MRC matched-filter outputs. For multiple 
antennas, r[t], w[t], y !q [m] and w /q [m] are column vectors. 

The matched-filter output is then 

y ; [m] = Rejjx ■ Vm]j 

= Re {i< -iziX •C 1Np [m']-* t [»-«'] + 2fi; .*,„[«]} 

L?=l 4=1 m- p=l g=1 J 

K v 

^^E^Z^lm'j-b^m-m^ + w^m] (7) 



I 9=1 P=l J 



w> ; [m] = Re- 



where is the estimate of af q and w{m]\& the match-filtered receiver noise. The terms 
for mV 0 result from asynchronous users. 
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3. Effect of Amplitude Estimation Error on MUD Efficiency 

MUD efficiency is defined in terms of the ratio of the intra-cell interference with MUD (l MUD ) 
to the intra-cell interference with the Matched Filter (MF), that is, the intra-cell interference 
without MUD (Imp): 



.ml-lfS- (8 ) 

' MF 



The total interference without MUD is l MF + J, where J is the inter-cell interference. 
Similarly, the total interference with MUD is l MUD + J. The ratio of inter-cell interference to 
intra-cell interference without MUD is denoted f = JII MF . The increase in system capacity is 
equal to the ratio of the total interference without MUD to the total interference with MUD, 
which is (l MF + J)/(l MUD + J) = Omf + fl MF )/(lMUD + fl MF ) = (1 + /)/(1 — Pmud + f)- For /= 0.3 and 
Pmud = 0.7, MUD increases the system capacity by a factor of 1 .3/(1 - 0.7 +0.3) = 2.2. 
Hence, if our goal is to double system capacity the MUD efficiency must be approximately 
H 70% or greater. 

In the following we estimate the loss in MUD efficiency, 1 - Pmud, due to imperfect channel 
estimation. For simplicity of presentation we consider approximately synchronous users. 

jjj Recall that in a synchronous system the matched -filter outputs can be expressed as 

m 

1 yi=iA+ Z r A+*7/ (9) 

W k=\,k*i 

W 

fjj and that the intra-cell interference is then 

O K , , 

m v=X £ fe) no) 



The effect of channel amplitude errors is that the estimates of the R-matrix elements (r, k ) 
are imperfect, which reduces the interference that is cancelled. When MUD is employed 
with imperfect R-matrix estimates the detection statistic is 

k=l,k*l 

= ^b l + %r lk b k - tr,A+^ (11) 

k=l,k*l k=l,k*l 

k=l,k*l 



where for the present case we have assumed that the bit estimates are perfect. With 
MUD the intra-cell interference is 
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W | lE{(r lk -r !k )(r Ik .-r lk .)}E{b k b k } 
Now from Equation (7), specialized for synchronous users 

r lk -J + k c;j 

Hence the second-order statistics are 

I -^t tffe^ <AA Y +£ ,^„c,;, •a» £v c w } 

a -i-^. E ;.2.[l + |p|»] 

a; i 2 }-*K» i 2 } ej, =£{ e ,.„ i 2 }=£{ £jj , r} 
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(13) 



(14) 
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where we have assumed that the amplitude error is independent of the amplitude and we 
have used 



Hi 



(15) 



The second expression is discussed below. We refer to E k as the error amplitude for the 
Mh virtual user. The residual interference after MUD IC is 



A 2 
2N, 

^[« + ^fe-2-[l + |p| 2 ] 



^- [(K - l)aE 2 d + KE] ]■ 2 ■ [l+ 1 p | 2 ] (1 6) 



a 

sD 

ifl where all data channels have amplitude A The error amplitude for the control channels is 

JfJ denoted E c and the error amplitude for the data channels is denoted E d . All data channel 
amplitudes are determined by scaling the corresponding control channel amplitudes by 

^ 1/(3 C . Hence E d =E c J$ c . 

H Similarly we can show that 

i E{rl)=^-Af ^.2.[l + |p| 2 ] (17) 

0 2N i 



so that the matched-filter interference is 

= ^- [(K - l)c*4 2 + Kfc A 2 ]■ 2 • [l+ 1 p | 2 ] (1 8) 

2N t 

^|^[« + ^]A 2 .2-[l + |p| 2 ] 
Finally, the MUD efficiency is 



i MF {a) 
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4. Conventional Channel Estimation 



The conventional channel amplitude estimate is given by 

= ii^SC %p [m']^Xfe / [m]-^[m-m I ]+-i r |; W/g [m]-fc I [m] (20) 

k=l p=l m M m=\ M m=l 



/ ft N'] = -^y^[m] b k [m-m'] (21) 

In the above <b/m7 represent the known pilot bits. (The Ah virtual user is implicitly a control 
channel.) The number M represents the number of pilot bits used to derive the channel 
amplitude estimates. The channel amplitude estimate can be rewritten 

K v L 

k=i p=i 

= t h w ■ + it * * ■ % + < 22 > 



P =i *=i 

It is shown in the appendix that 

^wU =*a^a^K* i 2 U (23) 

Hence the variance of the estimate is 
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• }= i *f ».* i 2 1 • «;* } 

I 2 }=^{l^ l l }=i*f l 2 R +|t | 2 R +Wi -EJ (24) 



The factor p £ simply reflects the fact that the off-diagonal elements are smaller than the 
diagonal elements due to partial correlations p kp between the antenna elements. In the 
Appendix it is also shown that 



r 4 i (25) 

Now combining Equations (24) and (25) gives for the variance of the channel amplitude 
estimate 



k =i*{j» w \ 2 \k +ii £ {t^ P i 2 i< 



TV, L ' MN,t( k 

where we have used A, p 2 = Af/L. The first term represents the variance due to a user's 
own multipath interference. This term is small compared to the variance arising from the 
total multiple-access interference. For simplicity we incorporate part of this term into the 
second term and drop the remainder. The final term represents thermal noise and other- 
cell interference. For now we assume that thermal noise in small. The interference arising 
from other cells is assumed to be proportional to the same-cell interference, with a 
constant of proportionality f = 0.35. With these assumptions we have 



Page No. 79 



EV 093 931 868 US 
Page No. 106 



(27) 



Notice that the magnitude of the error E/ is approximately the same for all users. Also, the 
Ah users is implicitly a control channel, and hence N,= PG = 256. If the K v virtual users 
are all at the highest spreading factor, then in terms of the K = KJ2 physical users we 
have 



= <! + /)- 



-[Kp*A 2 +KaA 2 ] 



(28) 



where Ec is the magnitude of the channel amplitude error for a control channel, j3 c is the 
relative control channel amplitude, A is the amplitude for the data channels, and where a 
is the activity factor for the data channels. Since the channel amplitudes for the data 
channels are determined by scaling the amplitude of the corresponding control channel it 
is evident that E d = Ec//? c . Hence, 



= 0 + /)- 



(29) 



Given the parameters 



f 

K 
L 
M 
PG 



= 0.35 
= 128 
= 4 
= 18 
= 256 
= 0.4 
= 0.7333 



we get 



5i = 
A 



(1 + /)- 



= l(l + 0.35) (128X4) [l + ^M 
Y (18)(256)|_ (0.7333) 2 J 



(30) 



The number of pilot bits, M, is taken to be 18, which represents 6 bits per slot, the 
amplitudes averaged over 3 slots. The corresponding MUD efficiency is 



-ftf- 



l-(0.5l) 2 =0.74 



(31) 
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5. Improved Channel Amplitude Estimates 

One method for significantly improving the channel amplitude estimates is to perform a 
second estimate directly on the data channels after the initial data channel demodulation. 
Performance is improved for two reasons. First, the entire slot can be used for integration. 
Hence we have M = 3(10) = 30 bits. Secondly, the error is not scaled by 1/& since the 
estimate is performed directly on the data channel. For this method we have 



j ( i +/ )_^_[s7 



(1 + 0.35) (128)(4) [(0.7333) 2 +0.40] (32) 
f (30)(256) L 



-0.29 

and the corresponding MUD efficiency is 

E d 
A 



= l-(0.29) 2 =0.92 (33) 



q Slightly better performance can be achieved by using both data and control channels. 

IP This method can be performed either on the daughter card or on the modem card since it 

a is a single user method. The assumption is that the matched-filter BER is sufficiently 

0 good. 

w 

iU 

6. Multiuser Channel Amplitude Estimation 

Given the conventional channel estimates and the detected user bits it is possible to 
subtract the MAI which corrupts channel estimation. This method of channel estimation is 
referred to as multiuser channel estimation, as opposed to the conventional single-user 
estimation techniques. A simple multiuser channel estimation technique is presented 
below without analysis. Performance should be determined via simulation. 

From Equation (22) the conventional estimate is 

**=E»W a * +w % (34) 

A multiuser estimate is obtained by subtracting the known interference among the channel 
estimates 



Page No. 81 



(35) 



EV 093 931 868 US 
Page No. 108 

kptlq 

L kp*lq J VpHq L J 

where the (hopefully) improved multiuser channel estimate is denoted a, g . The first term 
above is the actual channel amplitude. The second term is the residual interference, and 
the last term represents thermal noise and other-cell interference, which is amplified by 
the multiuser interference subtraction. The extent of the amplification needs to be 
determined. 

p 7. Effect of Uncancelled Multipath Interference 

tff It is expected that a typical RAKE receiver will be capable of tracking up to approximately 

tfj 16 multipath components. Since the computational complexity of symbol-rate MUD is 

:§ quadratic in the number of multipaths L it is unlikely that MUD implementations will be 

able to cancel all multipath interference. The effect of uncancelled multipath is assessed 
¥ s below. 

Suppose that the RAKE receiver processes L' multipath components, but that the MUD 
implementation cancels interference for L < U components. From Equation (13) we have 



r ,k = ^Hfe\' c % P + K a i ' c kp} 



L' V_ 

Z q=\ p=l Z 9=1 P=L+i 

+ ^ f tk^kp -C lkqp +a» p a lq -Cl qp } + \ £ £&X -C lkqp + a»a lq C^} 



= \ X S fo" a k P ■ C ik qP + a kp a ig ■ C kp 1 
^ ?=] p=i 

r , k ~r lk = ^X£fe? e * ' C ^p +£ " P a i q - C l q p} + \lL ' C '%> +fl *"V C ^} 

I q =\ p =\ L q=\ p=L+] 

£ q=L+\ p=\ Z q=L+\ p=L+i 

and the variance is then 



(36) 
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Z ™lq=L+lp=l q=L+lp=L+l 

m i q =l p =] ^M, q=\ p=L+l 

^™ I q=L+l p=l ^i^l q=L+l p=L+l 

t y g=l p= l ?= ] p=L+ J q=L+ ] p=i g =L+ l p=L+ \ J 



(37) 



Note that p Xjk is the ratio of the uncancelled to cancelled interference for the /rth users. 
Similarly, we have 

E{4 }= 2 ' ^ p |2 \ a? a\ + \fi x k + p xJ + p xJ p Xtk ]A?A 2 k } (38) 

Now, neglecting the second order terms p x ,§ x , k and averaging over the users J3 X = E{p Xii } 
we arrive at 
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I MVD = X^fc-r,*) 2 } 

k=l.k*l 

= 2 '^ P ^ KA 2 {aE 2 d + E 2 c ) + 2P x (aA 2 d +A 2 )} 
= 2 lH p ^ KA 2 {(« + p 2 )E 2 d +2(3 x (a + p 2 )A 2 } 
= 2 ^ N P ^ KA\a + p 2 ){E 2 d+ 2P x A 2 } 



= lE{r,l} 

2N, c 1 J 



h2P x A 2 



ft)' 



(39) 



1 mp A 2 +2P X A 2 1 + 2/3, 

P Note that p x is the ratio of the uncancelled to cancelled interference. 

m in order to assess typical value for p x multipath models [1][2][3] were used to generate 

L random profiles. The models are based on data collected in four areas (A, B, C, and D) in 

H the San Francisco-Oakland bay area. Table 1 below summarizes the key results. The 

9j table shows the p x versus the number of multipath components L. 





L = 8 


L = 6 


L = 4 


L = 3 


L = 2 


L= 1 


Area A 


0.0019 


0.0064 


0.0481 


0.0961 


0.2376 


0.5819 


Area B 


0.0012 


0.0086 


0.0404 


0.1115 


0.1416 


0.5749 


Area C 


0.0004 


0.0054 


0.0291 


0.0948 


0.1649 


0.6603 


Area D 


0.0039 


0.0128 


0.0430 


0.0629 


0.1435 


0.4890 



Suppose p x = 0.05 and (Ec/Af = 0.51 2 = 0.260. Without taking uncancelled multipath into 
account we found p M uo = 0.74. Taking uncancelled multipath into account we find 



-iH*rH 

[(0.5 1) 2 +2(0.05)] 



1 + 2(0.05) 
= 0.67 



(40) 



where a worst-case p x = 0.05 is used. 
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8. Improved MUD Efficiency Due to Dropping Small Amplitudes 



If small amplitude multipath components are not included in the cancellation the MUD 
efficiency is reduced slightly due to the additional uncancelled multipath interference, but 
it is also increased because of the absence error resulting from the inclusion of these 
small noisy estimates. The net effect is a substantial increase in the MUD efficiency. From 
Equation (30) we have 



= (1 + 0.35) 



(128) 
(18X256) 



[ (0.7333) 2 J 



(41) 



= 0.065 



m 

:- 



where E d 2 is the error due to a single multipath (i.e. L = 1). From Equation (37) it is 
evident that if a particular multipath amplitude satisfies A kp 2 < E d 2 then it is advantageous 
not to incorporate this amplitude into the cancellation since the error is greater than the 
amplitude. Table 2 shows the mean number of paths E{L} which satisfy A kp 2 > E d1 2 and the 
ratio j8 x of the uncancelled to cancelled interference if only these mulitpaths are cancelled. 
The MUD efficiency is then calculated using 



(42) 





E{L} 


& 


pMUD 


Area A 


2.0300 


0.0714 


0.7638 


Area B 


2.4660 


0.0691 


0.7482 


Area C 


2.2970 


0.0680 


0.7564 


Area D 


2.0690 


t 0.0625 


0.7748 


Mean 


2.2155 


0.0678 


0.7608 



9. Conclusions 

This report represents a first-look at channel estimation and the effect of errors on the 
MUD efficiency. Only the case where all users are at the highest spreading factor has 
been examined. The initial results indicate that if the conventional channel estimates are 
used the MUD efficiency drops to 74% due to estimation errors. If the effect of 
uncancelled multipath interference is also considered the MUD efficiency drops down to 
67%. If small amplitude multipath components are not included in the cancellation the 
MUD efficiency is reduced slightly due to the additional uncancelled multipath 
interference, but it is also increased because of the absence error resulting from the 
inclusion of these small noisy estimates. The net effect is a substantial increase in the 
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MUD efficiency, which is increased to 76%. The actual MUD efficiency will, of course, be 
less due to other factors which degrade efficiency. If an improved single-user channel 
estimation is used the MUD efficiency can be increased to 92%. This improved method 
requires knowledge of the pre-MRC matched-filter outputs. It is perhaps possible to 
further increase the MUD efficiency by employing multiuser channel estimation. These 
techniques also require knowledge of the pre-MRC matched-filter outputs. The above 
referenced MUD efficiency numbers are based on 128 users processed by the 
basestation. If fewer users are allowed access to the system in order to increase range 
the MUD efficiency is unchanged shine the total interference and noise remains 
unchanged. 
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Appendix A 

In order to estimate the variance of the channel amplitude estimate we need the second 
order statistics 

E{n, qkp ■ *,wu = • m ■ 

where we have used 

£fc„M ■ C w ,[-fl- J- • «„ ■ «„ ■ • (A2) 

which is derived in Appendix B assuming random codes. In order to evaluate E{lf k [m']} 
we consider two cases: 1 ) k = l, and 2) k * I . For k = I we have 
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E{ll [iri ]}= X & > " ^ W • [m - m' ] • & ; [n - m 1 ]| 

= ^t££[^o+(1-^,)5 TO! ] 



whereas for k*l we have 



(A3) 



5 ~M 

5 Hence, combining Equations (A3) and (A4) we have 

'1 Equation (A1) then becomes 

™ 4 w ^?^K«-KKl (A6) 

Now specializing Equation (A6) to the case where k = I 

E \ H p) .jJVlfi-jAU (AT) 

The above expression is further, simplified if we assume that users are approximately 
synchronous so that N l!qp [0] ~ N h which gives 
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Similarly, specializing Equation (A6) to the case where k 



Appendix B 

In Appendix A we used the approximation 

4c, kgp [rn].C; kV .[m']h (B1) 

under the restriction that Iq^kp . We show here that this expression is exactly true for 
chip-synchronous users, and that the approximation is reasonably valid for chip- 
asynchronous users, particularly when differences in delay lag are greater than about 2 
chips. The analysis is based on random user codes. 

The user correlations can be explicitly related to the code correlations as follows 

C lkqp [m] = -l-J^gK/- j)N e +mT +r lq -x kp \- c]\i\c k {j\ 
= C lk [r, kqp [m]] 

(B2) 

C lk [T ] = -i- XX 8W ~ J)N C + t] • c t [i] c k [j] 

^ngpim\smT+T lg -X kp 
Consider two cases: 1) l^k , and 2)1 = k . 
Case 1 

When I * k the second-order statistics become 

E{c, k [r] • c;.[t']}= [T]- g,,.[r'] • Efa[i\ ■ c r Vl ■ c k [j] ■ c* k ,[f]} 

= " J 8 1, [T] • g r ,■ [T' ] " 25-.$,,. ■ 2$ tt . $ ,.. 

4N,N r £}*' L ' J " " kk » (B3) 



5a "^Z«,[T]-« g [r'] 



Page No. 88 



EV 093 931 868 US 
Page No. 115 

where we have used the assumption of random user codes, independent among the 
users. Note also that the summation over i is over the range where Ci[i] is non-zero, and 
similarly the summation over j is over the range where c k {j]\s non-zero. 

Case 2 

Now consider case 2 where I = k 



e{c u [t] ■ c,;.[t-]}= [t]. g,,.rr'] ■ Efcm ■ c r [/•] ■ c ,[ji cim} 

= SnT?i g * [Tl " S - [T ' 3 ' E ^' m ' c,ul ' c,in ' c ' u '^ 



When 1*1' we have 



E{c n [r] ■ c; 4 .[t']}= ^-S^m- - fifcm - • c,.[n- c ;.[/]} 

■S^[T]-g fJ .[T']-25 ff -2ff P/ 



(B4) 



5,, 

= 4iV l JV r ^** l " J 6 ' 7 " J "° "' T (B5) 



= 5 ri .g[T]g[T'] 

whereas when / = /' we have 

£{c„m ■ c;.[t']}= ^ r J gj ( [t]- g r/ [r'] • Jsfem-c^ri-c^/i-^.E/]} 

= ^7tI 5>„M •*r J -[T , ]-£t;[n-c l [i-]-c,[j]-c;[/]} 
= T!!fr\L8 v W - g.ytT'] " 25,, - 25 y - Xg,[T]- g,,.[T'] • 25,. • 25 

+ 2 g,j M- g (/ [T' ] • 2ft, [/■ ] • c ; [/ ]}| 
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= ^{Lg i j\T]-g IJ [r']-2-2-^g[r]-g[r']-2-2 



= ^|X*ifM-g tf [T , ]-iV l g[T]-g[T'] + J Y | 2 g[T]-g[T']J 
= S n .g[T] ■ g[f] + |X g . [T] • g, [T'] - iV,g[T] • g[f '] j 



Hence combining Equations (B5) and (B6c) we have 



and combining cases for Z * * and l = k we have 



(B6c) 



EfcjT]-C; t .[T^}=5^ (B7) 



£{cJt]Ot']}=5,^^ 

= g ft •g / . t .g[r]-g[T']+ gft |Xg B [T]-g t [T']-^ /g [T]- g [r']| (B8) 

= g ft -g ; ,.g[T]-g[T']- ^ • V g[T]-g[T'] + ^fcXg,[T]-gjT'] 

The above expression can be used to determine the second-order statistics for the 
general case of symbol-asynchronous and chip-asynchronous users with arbitrary 
spreading factors. In what follows we will be interested in approximating the above 
expression so as to get simple but meaningful results. In order to simplify the expressions 
we consider users all at the highest spreading factor, and we assume that certain small 
values are zero. 

To assess the accuracy of channel estimation we need to determine the second order 
statistics 
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with Iq^kp . The function gtx]gf[T'] in Equation (B8) above is small unless both x and x' are 
close to zero, and for the chip-asynchronous case function is exactly zero since unless 
both x and x' are equal to zero. Since for lq * kp the probability that xi kqp [m] is close to 
zero is small a good approximation is to assume that these functions are zero. The third 
term can be written 



£{C tt [T].C;,[T']^^|^2^[T].g v [T']| 



(B10) 



The double summation in the brackets 



SjT,T'] S -i-Xg, y [Tj-g,[T'] 



(B11) 



5 plotted in Figure B1 for N,= N k = 256 versus x - %' for (x + x' )/2 = 0. 




-8 -7 -6 -5 



2-1 01 2345678 



Figure B1 . Plot of S lk [x,x'] for N,= N k = 256 
versus x-x' for (x + x' )/2 = 0. 

The sharp localization around x - x' = 0 is valid for all values of (x + x' )/2, except that for 
(t + x' )/2 large peak value drops off due to the partial overlap of the codes. Hence for 
delay lag differences x-x' greater than about 2 chips a good approximation is 



S ft [T,T'] = S Tr .-S,jT,T] 



(B12) 



This approximation then gives 



£{cJt]-C ; ;,[t']}e 



(B13) 
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which implies 



E^^m-^wis — S.r-S^-d^ -8 pp . -d mm ,-S !k [r,T] (B14) 
provided the delay spread is less than a symbol period. Now it can be shown that 

N ' ' (B15) 
where N lkqp [m']\s the overlap between the user codes. Our final result is then 
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J 1 . Multi-User Signal Model 

% The Rake receiver operation described in the next section is based a signal model. The 

t - MUD algorithm and implementation are based on the same model. This model is 

: described below. 

_ 

a Figure 1 shows how the uplink complex spreading for the Dedicated Physical Data 

O CHannels (DPDCHs) and the Dedicated Physical Control CHannel (DPCCH). There can 

ijj be from 1 to 6 DPDCHs, denoted DPDCH k , for k from 1 to 6. If there is more than one 

Z DPDCH, then the spreading factor for all DPDCHs must be equal to 4. For a single 

lp DPDCH (DPDCHi) the spreading factor can vary from 4 to 256. The data bits for channel 

Q DPDCHi are spread by channelization code c d:1 = C C h,sF,sF/4, where SF is the DPDCH 

|1j spreading factor. These channelization codes are referred to as Orthogonal Variable 

Spreading Factor (OVSF) codes. They are equivalent to Hadamard codes, except for their 
ordering. When there are multiple DPDCHs then dedicated channels DPDCH k , for /cfrom 
1 to 6 are spread by channelization codes c dM = C C h,4,n, where the relationship between n 
and k is represented in Table 1 . 



Table 1. Relationship between n and k. 



n 


k 


1 


1,2 


3 


3,4 


2 


5,6 



The data bits for the DPCCH are spread by code c c = C ch , 25 6,o- The spreading factor for the 
DPCCH is always equal to 256. The multipliers p c and fi d axe constants used to select the 
relative amplitudes of the control and data channels. At least one of these constants must 
be equal to 1 for any given symbol period m. 
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DPDCH, KX) KX)— »► " 



DPDCH, »<X) ><X)— ► _ 



a 

y 



DPDCH 4 »<X) <X)— ► 

DPDCH 6 ><X) <X)— ► 

DPCCH »<X) ►(x) — ► 

Figure 1 . Uplink complex spreading of DPDCHs and DPCCH 

The uplink spreading for any one of the seven Dedicated CHannels (DCHs) above can be 
represented as shown in Figure 2. 



Q- 



b[m] . 



tSF 



<x) Kx) — <x>- 



F/gure 2. A second representation of the uplink spreading 
for any one of the seven Dedicated CHannels (DCHs). 



where the code c[n] is given by 



c[n] = 



C ch ^ 0 [n]- jS sh [n], DPCCH 

C ck , 256M [nl-S sh [n}, DPDCH, 

C clu256M [n]jS sh [n], DPDCH 2 

C cA>256 , 192 [«]-^[n], DPDCH 3 

C^^W^Jn], DPDCH 4 

C cA . 256 .i 28 [n]-5 sA [n], DPDCH 5 



0) 



'c/i,256,128 



[n] AJn], DPDCH 6 
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and 



DPCCH 
DPDCH, 



(2) 



For a DCH with a spreading factor less than 256 there are J = 256/SF data bits 
transmitted during a single 256-chip symbol period (i.e. 1/15 ms). From a signal model 
perspective, the J data bits transmitted per symbol period can be viewed as arising from J 
virtual users, each transmitting a single bit per symbol period. The idea is illustrated in 
Figure 3. 



b 0 [m]=b[0 + mJ] 



bj.AmlsbU-l + mJ] 




Figure 3. Transforming a single user with bit rate J bits per symbol period 
into J virtual users, each with bit rate 1 bit per symbol period. 

The codes for these virtual users are formed by extracting SF elements at a time out of 
the DCH code sequence to form J new codes. Each of the J codes is of length 256 chips, 
but with only SF non-zero chips. That is, 



r n \c[n], j 



SF <n<(j + V-SF 
otherwise 



(3) 



This code-partitioning concept is illustrated in Figure 4 for the case SF = 64 so that J = 
256/SF = 4 codes are derived from the one DCH code. 
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s 




Figure 4. Code partitioning concept illustrated for the case SF = 64, 
u whereby J = 256/SF = 4 codes are derived from a single DCH code. 

3 The control channel can also be viewed as a virtual user. Hence, for a given physical user 

jjg with spreading factor SF there are 1 + 256 Nd/SF virtual users, where N D is the number of 

£ DPDCHs. (Recall that for N D > 1 , SF= 4.) 

O It turns out to be convenient to use a double indexing scheme to i dentify virtual users. Let 
CP paired indices kj represent the yth virtual user associated with the kih dedicated channel. 
* Index j varies from 0 <= j < J k = 256/SF k , where SF k is the spreading factor for the klh 
O dedicated channel. For the remainder of this section the spreading factors SF k are 
W assumed to be constant. In section 3 the equations are reformulated to allow for symbol- 
ic by-symbol changes in the spreading factor. 

„-= The transmitted signal for virtual user kj can be written 

x kj lt] = P k J J v kj [t-mT]b k] [m] (4) 

where t is the integer time sample index, T = NN C is the data bit duration, N = 256 is the 
short-code length, N c is the number of samples per chip, b kJ [m] are the data bits, and 
where v kJ [t] is the transmit signature waveform for virtual user kj. This waveform is 
generated by passing the spread code sequence c kJ [n] through a root-raised-cosine pulse- 
shaping filter h[t] 

s kj m = %h[t- P N c }c kj [p] (5) 

Note that p k = p c if the Arjth virtual user corresponds to a control channel. Otherwise p k = 
P d . 

The total number of virtual users is denoted 
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(6) 



where K D is the total number of dedicated channels. The baseband received signal after 
root-raised-cosine matched-filtering can be written 

m = ^ X S V ~ mT ^J t m] + (7) 

*=] j=0 m 

where w[t] is receiver noise with a raised-cosine power spectral density, and where s kj [t] 
is the channel-corrupted signature waveform for virtual user kj. For L multipath 
components the channel-corrupted signature waveform for virtual user kj is modeled as 

where a kp are the complex multipath amplitudes. The amplitude ratios p k are incorporated 
into the amplitudes a kp . Notice that if /cand / are two dedicated channels corresponding to 
the same physical user then, aside from scaling the by p k and fr, a kp and a lp , are equal. 
This is due to the fact that the signal waveforms of all virtual users corresponding to the 
same physical user pass through the same channel. The waveform s k ,[t] is referred to as 
the signature waveform for the kjth virtual user. This waveform is generated by passing 
the spread code sequence c kJ [n] through a raised cosine pulse-shaping filter g[t] 

s kJ [t] = %g[t-pN c ]c kj [p] (9) 

p=0 

Note that for spreading factors less than 256 some of the chips c kJ [p] are zero. 
2. Rake Receiver Operation 

This section describes the operation of a typical Rake receiver. Figure 1 shows a 
representation of the received antenna data that is delivered to the Rake receivers of all 
users. 
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Start of 
.buffer 



One symbol (i.e. 256 chips) 



Start of frame for u; 



I- I I I I I I 



Start of frame for u: 



Start of MUD 
processing frame 

Jr? Figure 5. Received antenna data delivered to the Rake receivers of all users. 

The figure shows the received signals corresponding to users / and k. These signals are 
J combined in free space so that the receivers gets one composite signal, which we denote 

«tP r[t]. The buffer length is assumed to be an integral number of frames in length so that 

O delay lag values T, q are approximately constant with each new filling of the buffer. For 

P each finger of each user there is a delay lag value T, q indicating the start of frame for the 

* qth multipath of the /th user. Lag values T /(7 are assumed to be constant over a frame, but 

S are allowed to change from frame to frame in response to the delay locked loop operation 

W and in response to new searcher-receiver sweeps where new delay lags are found. The 

I s * lower case values % iq = T, q mod 256N C denote the symbol -period offset relative to the start 

J of an internal symbol period reference clock. Notice that the user spreading factors 

change on user frame boundaries. Since users are asynchronous it is impossible to have 
^ a MUD processing frame that corresponds to all user frame boundaries. Hence the MUD 

processing frame is matched as close as possible to the user frame boundaries, but does 
not necessarily correspond precisely to any user's frame boundary. Consequently there 
will be spreading factor changes that occur during a MUD processing frame. Handling 
these mid-frame changes is the subject of section 3 below. 

The received signal above, which has been match-filtered to the chip pulse, must next be 
match-filtered by the user code-sequence filter. Since the spreading factor for the 
DPDCHs is not known, the Rake receiver performs an initial 4-chip despreading over all 
DPDCHs. The Fast Hadamard Transformation (FHT) can be used here to reduce the 
number of operations. The detection statistics for the multiple fingers and multiple 
antennas are maximal-ratio combined. Since the DPCCH is always spread with a 
spreading factor of 256 the DPCCH can be entirely despread during each symbol period. 
TFCI bits are extracted each slot from the DPCCH. After an entire frame is processed the 
TFCI is decoded and the spreading factor for that frame is determined. After spreading 
factor determination the final DPDCH despreading is performed. The resulting detection 
statistics are denoted here as y kj [m], the matched-filter output for the /c/th virtual user for 
the mth symbol period. Since there are K v codes, there are K v such detection statistics, 
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which are collected into a column vector y[m] for the mth symbol period. The matched- 
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filter output ydm], for the /Ah virtual user can be written 

y„[m] = Re- 



(10) 



where is the estimate of a, q , f lq is the estimate of r lq , and N,is the (non-zero) length 
of codes Cn[n] (i.e., the spreading factor for the fth dedicated channel). The intermediate 
result ynjm] represents the despread signal at the oth lag, and is here referred to the pre- 
MRC matched-filter output. When multiple antennas are employed, r[t], yajm] and a lq axe 
column vectors with one complex element per antenna. 

The matched-filter detector estimates the transmitted data bits as b u [m] = sign{y ti [m]} . 
Multiuser detection is considered in the next section. 



f 3. Multiuser Detection Equations and Asynchronous Processing 

~" As shown in Figure 5 a MUD processing interval must necessarily by asynchronous with 

most user's frame boundaries since the users are asynchronous. Because of this 
spreading factors will change during a MUD processing frame. When the spreading factor 

f=i changes during the processing frame the MUD equations are modified. These 

[•1 modifications are considered in this section. 

V The modem delivers matched-filter data to the MUD function on a frame-by-frame basis. 

□ Let N P [r] represent the number of physical users accessing the system during frame r. For 

each frame the following data is received for physical users p = 1 to N P [r] and each 

dedicated channel / 

• Number of DPDCHs, N D , P 

• Spreading factor, SFi 

• Amplitude ratios fa and fa 

• Slot format 

• Channel amplitude estimates a, q 

• Channel lag estimates T, q 

• Matched-filter outputs f,j[m]iox all DCHs 

• Code numbers 

• Gap information for compressed mode 

Matched-filter outputs Urn] correspond to the matched-filter outputs ydm]. If the fth 
dedicated channel is a DPCCH then matched-filter outputs are only received for the TPC, 
TFCI and FBI bits. The frfm] values are mapped to the y„[m] values as described below. 
The mapping accounts for the frame offsets between the various users. The amount of 
matched-filter data received per physical user depends on the DPDCH spreading factor. 

For each dedicated channel a symbol offset m, is determined according to 
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where div denotes integer division (i.e. with truncation). The symbol offset represents the 
fact that the users and hence the frame data are asynchronous. The y-data used for 
interference cancellation is derived from the frame data using 

y ff [m] = / /l -[m-mj (12) 

Figure 6 shows an example mapping of user data frames to MUD processing frames. To 
illustrate concepts the frames are each 16 symbol periods long rather than the actual 150 
symbols for WCDMA. The height of the blocks represents the number of virtual users per 
physical user. For physical users 1 and 4 the spreading factor changes in going from data 
frame 1 to data frame 2. As shown in the figure this results in spreading factor changes 
within the MUD processing frame. The MUD function is designed to Calculate the C- 
matrix once per frame. Hence mid-frame changes to user spreading factors pose a 
problem which requires special treatment. It turns out, and will be shown below, that mid- 
frame changes to the spreading factor can be accommodated by performing modified 
calculations based on the minimum spreading factor over the MUD processing frame. 

User data frames MUD processing frames 



01 Frame 1 Frame 2 Frame 1 Frame 2 




Figure 6. Mapping of user data frames to MUD processing frames. 



First we develop the MUD matrix signal model which allows user spreading factors to 
change on a symbol-by-symbol basis. We then show how we can perform the processing 
based on the minimum user spreading factors over the MUD processing frame. 

Let us reformulate the signal model presented in section 1 so as to allow spreading 
factors to change every symbol period. For every DCH k, there are J k [m] virtual users, 
where index m is the symbol period index. The number of DCHs J k [m]\s 

y [jB] ._?56_ (13) 
SF k [m] 

8 
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where SF k [m] is the spreading factor for the /cth dedicated channel during the /nth symbol 
period. The signature waveform for the yth virtual user of J k [m] total belonging to the /cth 
DCH over the mth symbol period can be written 

hjJt] = J,g[t-pN c }c„Jp] (14) 

p=0 

where the codes and hence the signature waveforms now include the symbol-period index 
m to account for symbol-by-symbol spreading factor changes. The channel-corrupted 
signature waveform is then 

^] = iXW'- T *p] (15) 
and thus the received signal corresponding to K D dedicated channels is 

M = £ £ 'It**- [t ~ mTV? k j w + *m (1 6) 

The MUD matrix signal model proceeds from substituting the received signal r[t] from 
Equation (16) into Equation (10) for the matched-filter outputs 

n fc=l j=0 [«=1 Z.IS l \m\ r J 

r, k=l j=0 

(17) 



[m,n] = 2XRe{a ; > tp • C„ kjqp [m,n]} 
9=1 p=i 



where r7//m7 is the match-filtered receiver noise and A/,/m7 = SF/m;. The terms for m'oO 
result from asynchronous users. 

The delay lags %, q for a given DCH / will under most circumstances be grouped within a 
range of from 4 to 8 [is. Under extreme conditions the delay spread will be as high as 20 
US. In any event, let x, represent the mean delay lag x, q over index q. According to 
Equation (10) above, the matched-filter detection statistic yn[0] is the result found by 
correlating the received signal starting roughly at delay lag i h where x, is approximately in 
the range 0 to 256/V c . If x, moves significantly outside this range an adjustment in the 
symbol period alignment will need to be made to restore x, back to within the desired 
range. More will be said about this below. Along the same lines, the detection statistic 

9 
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yjm] is the result found by correlating the received signal starting roughly at delay lag x, + 
mT. 



For efficient MUD processing it is important for the C-matrices to be constant over a 10 
ms MUD processing frame. We now describe a method which operates on constant C- 
matrices. Handling changes to user spreading factors is relegated to the IC portion of the 
MUD processing. Let us define 

J k =maxJ k [m] (18) 



where the maximization is over symbol periods m that contribute to the current MUD 
processing frame. This includes not only symbol periods that fall within the MUD 
processing frame, but in addition a few symbol periods on either side due to 
asynchronous users. Note that the minimum spreading factor for the /cth DCH is SF k = 
256/J k . Now define the DCH contraction factor for the mth symbol period as 

C[m]EE _i_ (19) 
JJm] 

The DCH codes for a given symbol period can be expressed as a sum of the DCH codes 
corresponding to the minimum spreading factor. For the /cth DCH there are at most J k 
virtual users corresponding to the minimum spreading factor. Let the codes for these 
users be denoted c kJ [r], 0 <= j < J*. The codes for the mth symbol period, where there 
might be fewer virtual users, are denoted c kj , m [r], 0<=j< J k [m], where 



0+l)C t M-l 

'*-M= ,.2>f [r] (20) 

With this result we are now able to represent the MUD signal model in terms of the C- 
matrix and R-matrix elements based on the codes corresponding to the minimum DCH 
spreading factors. The C-matrix in Equation () above becomes 



C * Cm '" ]S 2ivN^ gC(r "' )JVc +(m-n)T + f lq -^]<Jr] • c k .„[s] 

<;+l)-C,M-I O+OQM-l 
= l^ L T^YL8V(r-s)N c +{.m-n)T+f l<l -x kp ] j^c* u .[r]- £c v [i] 

2N,[m] r s !=iC,M f-rCtW 

N,[m] r =i-c,im] y= } c t w 2N, r s 

i^r (<+])C,[mH 0+l)CJ«l-l 

= X X C„.,, p [m-„] 

N^m] ,' = , C ,M /wcw 
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where N t = min N,[m]= SF,. Similarly, the R-matrix becomes 

r m \™> "1 = XX Re ( a " a kp ■ C m iqP l m > »]} 

q=\ p=l 

N (,-+l).C,[ m ]-l (J+11WU L r -, 

JV,|mJ .'=;.Q[ OT ] NC,|»] ?=ip=i 
N (i+l)C,[m]-l (j+DQW-l 

^ " J 
% . [m-«]sXJ Re£ J • C,,^ [m - «]} 
so that the matched-filter outputs become 
S y fc [«]-2^ / S^[«.»^W+n i [m] 

• - j =0 



=XXX ttti 2 ^ w«-»]kw+i.i»] 



IP This last equation can be written 

u 



y„M = XXX X X r^..[m-«]K[n] + r7„.[m] 

» *=i 7=0 [^V/L'WJ r=;c,[m] jwca»] J 



(23) 



(24) 



(Z+J)Q[m]-l 

=Trrr X yir 

k d ;,[»H r(7+DQ[»]-i 1 
y*-[«]-X£ X j X ^^.[m-nlj^W 

-XX X X r, lf [m-n] b kj .[n]\ 

„ *=i ,=o L ZWC,W J 

= XX XW m - »] ,b vW 

where we have defined b kJ {n] = b kJ [n] ior \C k [n] <=]' < (j + 1)C k [n]. Equation (24) is based 
entirely in terms of matrix elements corresponding to the minimum spreading factor for the 
MUD processing frame. 
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To: Wireless Communications Group 
From: J. H. Oates 

Subject: WCDMA Downlink MUD Date: February 23, 2001 

1 . Introduction 

Multiuser Detection (MUD) is most often thought of as a technique to improve either 
capacity or coverage for the uplink. A few reasons why MUD is uplink-focussed are 

• Downlink MUD must be performed in the handsets, which are limited in processing 
power 

• Each handset is interested in only one signal 

• In the downlink users are separated by orthogonal codes 

However, there is typically a greater demand for capacity in the downlink. If MUD is only 
applied in the uplink the imbalance is even greater. While in the downlink users are 
separated by orthogonal codes, because of multipath there is still significant intra-cell 
interfernece. Equalization has been suggested as a means of restoring orthogonality, 
however the computationally attractive linear equalization methods tend to amplify the 
othe-cell interference and noise. 

A downlink MUD method is described in the next section which has reduced complexity. 
The Fast Hadamard Transform (FHT) is used to reduce complxity. The FHT is used in 
both the forward (demodulation) and backward (regeneration) directions. 



2. The Method 

The method proceeds according to the following steps 

• Receive amplitude and delay information form the searcher receiver 

• Start with the largest multipath 

• Multiply the received signal by the conjugate of the scrambling code (512 chips at a 
time) 

• Perform the FHT on the result (for multirate users, this is done in stages) 

• Determine soft data estimates 
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• Set user-of-interest data symbols to zero. 

• Do same for all multipaths 

• Proceed till end of slot 

• Estimate amplitudes and gain factors 

• Diversity combine results and make hard decisions 

• Use hard decisions, gain estimates and FHT to reconstruct chip sequence c[n] (with 
user of interest nulled) 

• Multiple c[n] by c sh [n] to form d[n] (with user of interest nulled) 

• Use amplitude estimates, delay lag estimates (from searcher) and raised-cosine pulse 
to construct chip filter 

• Pass d[n] (with user of interest nulled) through chip filter to reconstruct interference 
signal 

• Subtract interference signal from received signal 

• Demodulate with conventional RAKE receiver 



The WCDMA transmitted signal can be represented as 
s[t] = J,g[t-nN c ]d[n] 



d[n] = j£ G A [« div N k ] • c cKk [n]| ■ c sh [n] 
= c[n]c sh [n] 



c[n] = ^G k b k [n div N k ] • c ch k [n] 

lc=l 

where g[t] is the raised-cosine pulse 1 , N c is the number of samples per chip, and d[n] is 
the composite chip sequence from all users. The received signal is then 

o 

= i^g[t-X g -nN c ]d[n] 
The received signal advanced to the delay of interest is 



1 The chip-matched filter is artificially placed in the transmitter for simplicity of presentation 
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r[nN c +r p ] = ^ / a !j s[nN c + t p -T g ] 

L 

= lL a <i^^ nN c + ' c p ~ % q -mN c ]d[m] 

q=\ m 
L 

= X a i X 8l* P - * g - mN c + «] 
The received signal multiplied by the conjugate of the scrambling codes is 

L 

r[nN c + % p ] • c* sh [n] = ]T a q £ g[t p - r q - mN c ]c[m + n}c sh [m + n]- c sh [n] 

q=\ m 



= a p ■ c[n] + w[n] 



This result can now be demultiplexed using the 512 x 512 FHT. Since 512 = 2 9 , the FHT 
proceeds in 9 stages. After the first two stages the SF 4 symbols can be extracted. 
Similarly, after k stages the SF 2 k symbols can be extracted. The amplitudes a p can be 
determined from the embedded pilot symbols, or searcher-receiver estimates can be 
used. If embedded pilot symbols are used the measurements M pk of the pth multipath of 
the /cth user is in the form 

M pk =a p G k () 

which includes the user gain factor. After measurements are taken for all multipaths and 
all users for a given slot, the multipath amplitudes and user gains can be separated by 
determining the dominant left and right singular vectors of the rank-1 matrix M pk (aside 
from an arbitrary scale factor which can be given to either amplitudes or the gains). One 
the approximate amplitudes a p are known the actual amplitudes a p are determined by 
inverting the diagonally dominant system of equations 

L 
4=1 

g pq =«[T P -T,] 

The chip filter h[t]ior reconstructing the interference signal is 
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L 

9=1 

^h[t-nN c ]d[n] 
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31 1 Purpose 

32 The purpose of this memo is to document parts of the discussion we have been 

33 having on how the TI 6414 DSP may connect to the raceway. 
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35 2 Glossary 

36 EMIF - A port on the DSP 6000 series peripheral bus which allows the 

37 connection of memory devices. 

38 SDRAM - In the context of this memo, means the main external memory of the 

39 TI DSP - the one which contains the program and data. 

40 3 Overview 

41 

42 So far, a proposed architecture is that we use the second EMIF (External 

43 Memory Inter-Face) of the TI 6414 DSP to connect to a dual ported RAM. 

44 Raceway transfers actually access the RAM, and then additional processing takes 

45 place on the DSP to move the data to the correct place in SDRAM. In fact, if the 

46 dualport RAM is not large enough to buffer an entire Raceway transfer, then there 

47 will have to be a messaging protocol between the two endpoint DSPs wishing to 
P 48 exchange messages (because the message will have to be fragmented in order to 
y 49 not exceed the reserved buffer space). 

sp 50 An additional restriction of this design is that as more Raceway endpoints are 

;|j 51 added, the size of the dualport RAM needs to be increased, or the maximum 

gg 52 fragment size needs to shrink, such that the RAM is big enough to contain at least 

□ 53 2*F*N*P buffers of size F, where F is the size of the fragment, N is the number of 

pi 54 Raceway endpoints with which this DSP can exchange messages, P is the number 

d 55 of parallel transfers which can be active on any endpoint at a time, and the 

Q 56 constant 2 represents double buffering so that one buffer can be transferred 

yj 57 to/from the Raceway, while a second buffer can be transferred to the DSP. The 

!?* 58 constant becomes 4 if you want to be able to emulate a full duplex connection. 

if p 5 9 With a 4 node system, this might be 4* 8K*4*4 or 5 1 2K plus a little extra for 

© 60 bookkeeping information. This probably means the minimum size is 1M bytes for 

flj 61 the dual port device. 

62 4 Problem Identification 

63 There are several characteristics of this architecture which could prove 

64 problematic: 

65 4. 1 Requirement For A FragmentlDefragment Protocol 

66 Raceway transfers can currently be very long. This architecture would require 

67 a protocol for breaking transfers down into fragments. If the DSP is sourcing a 

68 transfer greater than the fragment size, then it has to either dedicate itself for the 

69 period of the transfer to programming the DMA engine, or it has to respond to 

70 interrupts as each fragment is transferred. In either case, there is a substantial 

71 performance impact above and beyond the normal performance hit due to 

72 memory bandwidth utilization. 
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73 If the DSP is on the receiving end of a Raceway transfer, a similar process has 

74 to take place, except that there must be an interrupt to get the attention of the DSP 

75 (polling would not be sufficient in such a case). 

76 Beyond the performance hit such a protocol would impose on the DSP, there is 

77 a major disadvantage in that only endpoints willing to implement this protocol can 

78 exchange data with the DSP. It is in effect, defining a defacto standard subset of 

79 Raceway. This is a major interoperability issue (you can no longer plug a board of 

80 DSPs into a fabric and have them work as a standard Raceway Adjunct 

81 Processor). 



82 4.2 Requirement For The DSP To Be Running Code 

83 If the DSP is involved in the Raceway transfers, then the DSP must already be 

84 running in order to perform Raceway transfers. This will require that all nodes on 

85 the Raceway be self booting. 



p 86 4.3 Lower Transfer Rates 

~ 87 Raceway is less efficient with smaller transfer sizes. If the fragment size is kept 

% 88 small to minimize dual port ram requirements, then aggregate Raceway transfer 

89 rates will be lower because of less effiicient utilization of the fabric. 

6 90 4.4 It Is Different 

m 91 By changing the way Raceway works, we initiate a significant departure from 

*_ 92 the way all current Mercury systems work. While there are many other possible 

M 93 architectures which will perform well, it is inherently risky to change a 

jW 94 fundemental model of how our multiprocessors communicate. 

J 95 5 An alternative Architecture 

jf|j 96 It may be possible to implement a different architecture which addresses some 

97 of these shortcomings. 
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Architecture Description 

The proposed architecture still has approximately the same hardware as the 
existing architecture. The changes are in the way that the Raceway transfers move 
between SDRAM and the Raceway. 

In the proposed architecture, the FPGA connects to both the buffering device 
(dual port RAM or FIFO) and the DSP. The connection to the buffering device 
(hereafter FIFO) is used to move Raceway data to/from the FIFO. 

The second connection is to the DSP Host Port. Dave currently believes this is 
a moderately high performance interconnect - on the order of 75 Mbytes per 
second. This interconnect could itself be used to move data to/from the DSP. The 
host port can access data in the DSP on-chip memory, as well as any of the 
peripheral devices, including the SDRAM. However, 75Mbytes per second is 
pretty slow compared to normal Raceway bandwidth, and we think we can do 
better. 

The 6414 contains a second EMIF which can be attached to the FIFO (this is 
similar to what the current architecture proposal intends). The difference in this 
proposed architecture is that rather than have the DSP program the DMA engine 
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118 to move data between the FIFO and the DSP/SDRAM, we propose that the FPGA 

119 can program the DMA engine directly via the Host Port. 

120 The Host Port is a peripheral like the EMIF and the Serial Ports. The difference 

121 is that the Host Port can master transfers into the DSP datapaths, i.e. it can read 

122 and write any location in the DSP. Because the Host Port can access the DMA 

123 Controller (we think), it can be used to initiate transfers via the DMA engine. 

124 The advantage of this architecture is that Raceway transfers can be initiated 

125 without the cooperation of the DSP. Thus, the DSP does not have to be self 

126 booting. Performance is increased in two ways: the DSP is free to continue to 

127 compute while Raceway transfers take place, and performance on the Racway is 

128 increased because there is no need to fragment messages. 

129 The internal datapaths of the DSP are flexible enough that we can control 

130 which devices have priority access to memory and datapath. Specifically, we can 

131 choose to give Raceway transfers priority over the CPU, or vice versa. 



Synchronization Issues 

There is an issue to be solved in how we match data rates between Raceway 
and the DSP. The EMIF looks to the DSP as if it were a memory, thus it is 
reasonable for the DSP to assume it can get at the data it needs at any time. 
However, if we indeed use a FIFO to buffer data, the implication is that there is a 
way to hold off the DSP when we are waiting for the Raceway to empty or fill our 
FIFO. A possibility is that the buffer device remains a dual port RAM rather than 
a FIFO, and the FPGA actually does a fragment/defragment into the RAM, and 
then programs the DMA engine to move that fragment into/out-of the DSP. This 
starts to look somewhat like the original architecture, except that because the 
FPGA performs the frag/defrag, the actual transfers over the Raceway can be 
arbitrarilly sized (assuming we can throttle the Raceway). 

Synchronization remains one of the larger problems to be solved with this 
proposed architecture. 



146 5.3 Sample Transfers 

147 In order to illustrate how this architecture would work, two examples are 

148 given. The first example is when the Raceway attempts to read data out of the 

149 DSP memory. 

150 5.3.1 Raceway Reading DSP Memory 

151 In this example, we assume that another DSP is trying to read the SDRAM of 

152 the local DSP. 

153 i) The FPGA detects a Raceway packet arriving, and decodes that it is a read 

154 of address Ox 1 0000 (for instance). 

155 2) The FPGA writes over the Host Port Interface in order to program the 

156 DMA engine. It programs the DMA engine to transfer data starting at 

1 57 location 0x10000 (a location in the primary EMIF corresponding to a 

158 location in SDRAM) to a location in the secondary EMIF (the buffer 
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159 device/FIFO). As data arrives in the buffer device, the FPGA reads the 

1 60 data out of the buffer device, and moves it onto the Raceway. When the 

161 proper number of bytes have been moved, the DMA engine finishes the 

162 transfer, and the FPGA finishes moving data from the FIFO to the 

163 Raceway. 

164 5.3.2 Raceway Writing DSP Memory 

165 In this example, we assume that another DSP is trying to write to the SDRAM 

166 of the local DSP. 

167 1) The FPGA detects a Raceway packet arriving, and decodes that it is a 

1 68 write of location 0x20000 (for instance). 

1 69 2) The FPGA fills some amount of the buffer device with data from the 

1 70 Raceway, and then: 

171 3) The FPGA writes over the Host Port Interface in order to program the 

1 72 DMA engine. It programs the DMA engine to transfer data from the buffer 
j"* 173 device (secondary EMIF) and to write it to the primary EMIF at address 
J2 174 0x20000. 

5 175 4) At the end of the transfer, we could either interrupt the DSP to signal that 
3 176 a Raceway packet has arrived, or we can use the standard Mercury method 

6 177 of polling a location in the SMB to see whether the transfer has completed 
D 178 yet. 

m 

179 5.4 Additional Thoughts 

1 80 1 ) We need to verify that the Host Port Interface can program the DMA 
JJ i8i engine. The documentation on the 620 1 clearly states that it can write to 
V 1 82 any location in internal memory, and to anywhere on the peripheral bus, 
Jf: 1 83 however the DMA engine/controller is the datapath controller for all that, 
^ 184 so it is always possible that there is a special case which does not allow 

" y 185 writing of the DMA engine/controller registers from HPI. The chance of 

186 this being so is quite remote, but needs to be verified. 

187 2) We need to understand the transfer rates and latencies of the HPI. This 

188 architecture relies on fairly low latency access through the HPI, otherwise 

189 more buffering space would be required, and at some point bandwidth 

190 begins to be affected. 

191 3) We need to understand the limitations of Raceway with respect to 

192 throttling, etc. The best case would be that Raceway can provide data as 

193 fast as the EMIF can take it (so we wouldn't worry about having data 

194 ready when EMIF wanted it), and also for Raceway to be able to be 

195 throttled so that it can take the data at the rate the EMIF can provide it. 

196 The more the reality deviates from this best case scenerio, the more extra 

197 logic is required in the FPGA until at some point complexity may prevent 

1 98 the architecture from being viable. 



a 
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1 99 4) What we currently know about the 6414 is actually educated guesses 

200 based on documentation of earlier DSPs. We are making some 

201 assumptions about how TI will have enhanced their chip. 

202 5) If/when TI ever puts a RapidIO interface on their DSPs, it will almost 

203 certainly look like a high speed HPI, i.e. it will sit on the peripheral bus, 

204 have a separate datapath channel, data coming in will simply flow to the 

205 correct addresses, and outgoing data transfers will happen by 

206 programming the DMA engine to send data to the RapidIO peripheral 

207 address. This proposed architecture looks almost exactly like that, and so 

208 probably will not require major changes to use a RapidIO enhanced DSP. 

209 6) There are probably more thoughts., but this is probably a good start. . . 
210 

211 
212 

a 213 
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6201 Design Options 



Option 1 



Option 2 



Option 1 is the original proposal submitted at the DSP meeting Monday. Option 2 was created during the 
meeting. 

The main shortfall in Option 1 is the sharing of the EM1F bus between the 6201 and the Raceway 
DMA FPGA. During DMA operations over the Raceway, the 6201 will not have access to the EMIF 
interface. Any data or instruction fetches from SDRAM will stall. Given the relatively small size of the 
internal SRAM, this will impose a significant penalty to the operation of the 6201. Option 1 also requires 
the FPGA to take over SDRAM refresh operation when it takes control of the EMIF bus. This passing back 
and forth of the refresh task will not be clean. 



Option 2 places a bi-directional transceiver between the 6201's EMIF bus and the Raceway 
SDRAM. This allows the 6201 to process data and fetch instructions without any interruption from it's 
local SDRAM while the DMA FPGA is accessing the Raceway SDRAM. The HPI interface is used by the 
6201 to program the DMA engine and by the DMA engine to indicate the DMA complete status to the 
FPGA. Option 2 also lends itself to a dual 6201 node per raceway interface. Decode logic, controlling 
access to the Raceway SDRAM can be designed in a number/combination of ways: 

Total access to both 6201s 

Separate areas for each 6201 

Read but no write to the other 6201 's memory space 

A separate common area accessible to both for message passing 

The ability of one 6201 to go through the transceiver to the others local SDRAM (not recommended) 
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For a migration story to the 6414, Option 2 is a better sell, Option 3 shows the 6414 design, the transceiver is 
stripped off and the Raceway SDRAM is connected to the second EMIF. The design will go to one DSP per 
raceway due to the increased in processing power of the 6414. 




Option 3 
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Subject: An Efficient WCDMA Receiver Design based on FileRef: mjv-019- 

the FFT efficient_wcdma_receiver.doc 



1. Introduction 

Typical processing: 

Signal is sampled at N samples per chip. 
Despread by 

upsampling chipping sequence by interpolating and using the RRC chip pulse matched filter as an 
interpolation filter 

Multiplying digitized receive signal by upsampled and interpolated chip sequence 
Accumulate (integrate) results for an entire DPCCH symbol. 

Repeat at the early lead and late lag sample offset values to calculate delay locked loop variables 
Sweep the code correlator N*256 lags to determine code synchronization and channel response 

Spreading sequence is 256 chips long 

Typical filter is 12 chips long 

typical oversampling rate on the receiver is N=8 

Key calculations 

Interpolation of the spreading code - precomputed and stored 

Correlation process: N*256 CMAC 

Correlation repeated for N*256 + 2 (DLL) times 

Total CMACS: N*256 * (N * 256 + 2) = N A 2*65536 + 512 * N 

For N = 8, this results in: 4,198,400 CMAC 

1 CMAC = 4 RMUL + 2 RADD = 6 ROP 

Results in 25,190,400 Real operations 

At 15000 Hz symbol rate, need: 378 GOP/s 
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2. A New Design 

Use of FFT to perform efficient circular convolution of spreading code sequence 
Results in 

Short code synchronization ( chip sync only, not slot or frame ) 

DPCCH demodulation 

Early and late Delay Locked Loop variables 

Rough channel estimate values for an entire symbol worth of differential delay 
Polyphase signal processing 

Digitize the signal at an Nx oversample rate and filter with the RRC filter and split into N streams at 
the lx rate. 

Compute the complex conjugate of the FT of the spreading code sequence at the chip rate - 
precomputed and stored 

£3 Computation: 

O Filter data at Nx oversample rate and split into N streams at lx rate 



For each stream, 

Compute 256 point FFT 

Complex multiply FFT with stored FFT values of spreading code 
Inverse 256-point FFT 

Ops calculation: 

Input filter: could be done using FFT as well. 

but for time domain processing: 8*256 points, filter length 96 => 

96 RMUL per point, 95 RADD per point, 

Total of 19608 RMUL, 194,560 RADD per symbol = > 391,168 ROP per symbol 
I and Q streams, => 782336 ROP 



Stream processing ( 8 streams ) 

Radix 4 FFT: 256*4*(4 CMUL + 8 CADD) = 34,816 ROP 
256 CMUL = 1536 ROP 

Radix 4 IFFT: 256*4*(4 CMUL + 8 CADD) = 34,816 ROP 
TOTAL per stream: 7 1 168 ROP 

Total stream calcs: 569,344 ROP 

Total ops per second at 15000 Hz symbol rate is: 20.3 GOPS 
more than 18 times more efficient than traditional approach. 

Also, the DLL circuitry can be eliminated since the entire channel response is calculated at the 
symbol rate. 

FFT numbers may be off by a factor of 2 larger in the number of complex multiplications needed. 
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Practical Implementation of an Iterative Hard-Decision MUD 
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Tel: 978-256-0052x 1659 

FAX: 978-256-8596 
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Introduction 



Multi-User Detection (MUD) has been shown to provide a number of significant benefits[l][2]. These 
q include increased system capacity, increased range, enhanced Quality of Service (QoS), improved near-far 

,f5 resistance, extended battery life, and reduced handset transmit power. This paper describes the practical 

implementation of Multi-User Detection (MUD) for the UMTS uplink using short codes. The focus is on 
~ practical implementation details such as efficient implementation of the calculations, processing 

%j requirements, latencies, MUD efficiency, and mapping to hardware. 

Q The use of short codes allows MUD to be performed at the symbol rate. As such MUD can be introduced 

[F into a conventional Base-Transceiver-Station (BTS) as an enhancement to the Matched-Filter (MF) RAKE 

,~ receiver. The MUD processing takes the MF detection statistics, performs interference cancellation, and 

^ then delivers improved hard or soft-decision symbol estimates to the symbol-rate BTS processing 

^ functions. The MUD processing introduces only a few milliseconds latency. Because of the reduced 

W computational complexity of MUD operating at the symbol rate the entire MUD functionality can be 

H implemented in software on a single card or daughter card populated with a minimal number of processors. 

We present here an implementation of an iterative hard-decision Interference Cancellation (IC) algorithm 
on four Power PC 7410 processors. The processors are connected together with a high-bandwidth RACE++ 
J "J interconnect fabric. 



In order to perform MUD at the symbol rate the correlation between the user channel-corrupted signature 
waveforms must be calculated. These correlations are stored as elements of matrices, here referred to as the 
R-matrices. Since the channel is continually changing these correlations must be updated in real time. 
There are two elements to updating the R-matrices. The first part is based on the user code correlations. 
These depend on the relative lag between the various user multipath components. It is assumed that these 
lags change with a time constant of about 400 ms. The second part is due to the fast variation of the 
Rayleigh-fading multipath amplitudes. It is assumed that these amplitudes are changing with a time 
constant of about 1.33 ms. The R-matrices are used to cancel the multiple access interference through the 
Multi-stage Decision-Feedback Interference Cancellation (MDFIC) technique. 



UMTS Uplink Multi-rate Signal Model and RAKE Processing 

We derive here the equations describing the MF outputs based on the WCDMA transmitted waveform. 
The users accessing the system will hereafter be referred to as physical users. Each physical user is 
regarded as a composition of virtual users. Each virtual user transmits a single bit per symbol period, where 
by symbol period we mean a time duration of 256 chips (i.e. 1/15 ms). The number of virtual users, then, 
for a given physical user is equal to the number of bits transmitted in a symbol period. At a minimum each 
active physical user is composed of two virtual users, one for the Dedicated Physical Control Channel 
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(DPCCH)[3] and one for the Dedicated Physical Data CHannel (DPDCH). If the physical user is a data 
user with Spreading Factor (SF) less than 256 then there are J = 256/SF data bits and one control bit 
transmitted per symbol period. Hence for the rth physical user with data-channel spreading factor SF r , there 
are a total of 1 + 256JSF r , virtual users. The total number of virtual users is denoted 



The transmitted waveform for the rth physical user can be written as 
sM^hlt-pNJc^p] 

where t is the integer time sample index, T = NN C is the data bit duration, N = 256 is the short-code length, 
N c is the number of samples per chip, and where ft = ft if the k&\ virtual user is a control channel and ft = 
ft if the kth virtual user is a data channel. The multipliers ft and ft are constants used to select the relative 
amplitudes of the control and data channels. At least one of these constants must be equal to 1 for any given 
symbol period m. The waveform sjt] is referred to as the transmitted signature waveform for the kth 
virtual user. This waveform is generated by passing the spread code sequence c k [n] through a root-raised- 
cosine pulse shaping filter h[t]. If the Mi virtual user corresponds to a data user with spreading factor less 
than 256 then the code c k [n] still has length 256, but only N k of the 256 elements are non-zero, where N k is 
the spreading factor for the Mi virtual user. The non-zero values are extracted from the code C C h,256,64 
Ssh[nR3]. The W-CDMA standard actually allows for up to six DPDCHs to be multiplexed with a single 
DPCCH. This functionality is not presently incorporated in the MUD algorithms described below. 

The baseband received signal can be written 

m=YZs t it~mT]b k [m\+m (3) 

where w[t] is receiver noise, J k [t] is the channel-corrupted signature waveform for virtual user k, L is the 
number of multipath components, and a kq - are the complex multipath amplitudes. The amplitude ratios ft 
are incorporated into the amplitudes a^-. Notice that if k and I are two virtual users corresponding to the 
same physical user then, aside from scaling the by ft and ft, a^- and a lq ; are equal. This is due to the fact 
that the signal waveforms of all virtual users corresponding to the same physical user pass through the same 
channel. The waveform s k [t] is now the received signature waveform for the Mi virtual user. This 
waveform is identical to the transmitted signature waveform given in Equation (2) except that the root- 
raised-cosine pulse h[t] is replaced with the raised-cosine pulse g[t]. 

Thus far the received signal has been match-filtered to the chip pulse. It must next be match -filtered by the 
user code-sequence filter. The resulting detection statistic is denoted here as y k , the matched-filter output 
for the M> virtual user. Since there are K v codes, there are K v such detection statistics, which are collected 
into a column vector y[m] for the mth symbol period. The matched-filter output yi[m], for the fth virtual 
user can be written 

y,lm] S RejjX "^rE^ +** + m?"]- *,*[»] 1 (4) 
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where a lq is the estimate of a " , z lq is the estimate of x Jq , N t is the (non-zero) length of code ci[n], and 
r\\[m] is the match-filtered receiver noise. Substituting r[t] from Equation (3) above gives 

' ?t Re {t % ~Y*dnNc + ** + m'T\- fl^l" ~m'] +ij,[m] 
r, [m'j - Rej^a; • -±-^s k [nN c +f „ + mT] ■ c>]J 

= I£ Re f -^r E s < [ ^ +»T+f„ -t v ]- c ;w| 

The terms for 0 result from asynchronous users. 



MUD Algorithm and Functions 



A vast number of MUD algorithms have been proposed [1][2]. Many of these are too computationally 
complex to be implemented with current technology. The linear-iterative class of MUD algorithms 
[4][5][6] are the least computationally complex. For this class of algorithms software implementation is 
feasible. The hard-decision variants of these algorithms also enjoy a significant performance advantage in 
that they do not tend to amplify other-cell interference. The down side is that performance degrades under 
high input BER. Since channel decoding reduces the BER by orders of magnitude, it is possible to be 
operating with raw channel BERs as high as 10%. A number of methods have been proposed to address this 
issue, including the null-zone detector [4], and partial interference cancellation [4][5][6]. We employ 
partial interference cancellation in conjunction with a new thresholding technique which reduces 
computational complexity. Our method provides excellent performance under high input BER. 

The implementation of MUD at the symbol rate can be divided into two functions. The first function is the 
calculation of the R-matrix elements. The second function is interference cancellation, which relies on 
knowledge of the R-matrix elements. The calculation of these elements and the computational complexity 
are described in the following section. Computational complexity is expressed in Giga Operations Per 
Second (GOPS). The subsequent section describes the MUD IC function. The method of interference 
cancellation employed is Multistage Decision Feedback IC (MDFIC)[2][7]. 



R-matrix 



From Equation (5) above, the R-matrix calculations can be divided into three separate calculations, each 
with an associated time constant for real-time operation, as follows 
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r !k [«'] = E t Re|<« v • *K» - PW C + riT + T^ - T w ]c k [p] ■ c',ln\ j 

= XlRe^X.C (tw .tm']} 



= ^ r Xg[^ + mT + T„-T t ,.]Ec t [«-m]- C ;[«] 

where we have omitted the hats indicating parameter estimates. Hence we must calculate the R-matrices, 
which depend on the C-matrices (C^-lm']), which depend on the r-matrix {T lk [m]). The r-matrix has the 
slowest time constant. This matrix represents the user code correlations for all values of offset m. For the 
case of 100 voice users the total memory requirement is 21 MB based on two bytes (real and imaginary 
parts) per element. This matrix is updated only when new codes (new users) are added to the system. Hence 
this is essentially a static matrix. The computational requirements are negligible. The most efficient method 
of calculation depends on the non-zero length of the codes. For high data-rate users the non-zero length of 
the codes is only 4 chips long. For these codes a direct convolution is the most efficient method to 
calculation the elements. For low data-rate users it is more efficient to calculation the elements using the 
FFT to perform the convolutions in the frequency domain. 

The C-matrix is calculated from the r-matrix. These elements must be calculated whenever a users delay 
lag changes. For now assume that on average each multipath component changes every 400 ms. The length 
of the g[] function is 48 samples. Since we are oversampling by 4, there are 12 multiply-accumulations 
(real x complex) to be performed per element, or 48 operations per element. When there are 100 low-rate 
users on the system (200 virtual users) and a single multipath lag (of 4) changes for one user a total of 
(1.5)(2)AT,iA' v elements must be calculated. The factor of 1.5 comes from the 3 C-matrices (m' - -1, 0, 1), 
reduced by a factor of 2 due to a conjugate symmetry condition. The factor of 2 results because both rows 
and columns must be updated. The factor N v is the number of virtual users per physical user, which for the 
lowest rate users is N v = 2. In total then this amounts to 230400 operations per multipath component per 
physical user. Assuming 100 physical users with 4 multipath components per user, each changing once per 
400 ms gives 230 MOPS. 



The R-matrices are calculated from the C-matrices. From Equation (6) above the R-matrix elements are 

r *W = t i>*VV - C^Wlh 1* • C a [ml • a k } ™ 

where a k are L x I vectors, and C ]t lm'] are L x L matrices. The rate at which these calculations must be 
performed depends on the velocity of the users. The selected update rate is 1.33 ms. If the update rate is too 
slow such that the estimated R-matrix values deviate significantly from the actual R-matrix values then 
there is a degradation in the MUD efficiency. Figure 1 below shows the degradation in MUD efficiency 
versus user velocity for an update rate of 1 .33 ms, which corresponds to two WCDMA time slots. The plot 
indicates that there is high MUD efficiency for users with velocity less than about 100 km/hr. The plot 
indicates that the interference corresponding to fast users is not cancelled as effectively as the interference 
due to slow users. For a system with a mix of fast and slow users the resulting MUD efficiency is a average 
of the MUD efficiency for the various user velocities. 
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Figure 1. MUD efficiency versus user velocity in km/hr 

From Equation (7) the calculation of the R-matrix elements can be calculated in terms of an X-matrix 
which represents amplitude-amplitude multiplies 

r a [m'] = Re£r[< ■ QM ■ «* 11= Rcffofm'] • ■ a? h Re^[C tt [«'] • X, t l 

= rr[c*[m'J • X*]-tr[c' ]k {niy X' lk ] (g) 

X, k =a,-af =X* + jX; t 
C, k [m'] = C«[m'] + jC; k lm'] 

The advantage of this approach is that the X-matrix multiplies can be reused for all virtual users associated 
with a physical user and for all m' (i.e. m' = 0, 1). Hence these calculations are negligible when amortized. 
The remaining calculations can be expressed as a single real dot product of length 2L 2 = 32. The 
calculations are be performed in 16-bit fixed-point math. The total operations is thus 1.5(4)(K V L) 2 = 3.84 
Mops. The processing requirement is then 2.90 GOPS. The X-matrix multiplies when amortized amount to 
an additional 0.7 GOPS. The total processing requirement is then 3.60 GOPS. 



MDFIC 

From Equation (5) above the matched-filter outputs are given by 

y,M = r„[0]i*[M] + XrJ-l&[m + l] + ^ < 9 > 

The first term represents the signal of interest. All the remaining terms represent Multiple Access 
Interference (MAI) and noise. The MDFIC algorithm iteratively solves for the symbol estimates b,[m] 
using 
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b,[m] = sign^ M ~ E r n t-U4 1™ + 1] ~X h [0] - r a [0]8 lk ]b t [m] -|j r I4 [to - 1] 

with initial estimates given by hard decisions on the matched-filter detection statistics, b,[m] = sign{y, [m]}- 
The MDFIC [7] technique is closely related to the SIC and PIC technique. Notice that new estimates b,[m\ 
are immediately introduced back into the interference cancellation as they are calculated. Hence at any 
given cancellation step the best available symbol estimates are used. This idea is analogous to the Gauss- 
Siedel method for solving diagonally dominant linear systems. 

The above iteration is performed on a block of 20 symbols, for all users. The 20-symboI block size 
represents two WCDMA time slots. The R-matrices are assumed to be constant over this period. 
Performance is improved under high input BER if the sign detector in Equation (10) is replaced by the 
hyperbolic tangent detector [6]. This detector has a single slope parameter which is variable from iteration 
to iteration. 



(10) 



The three R-matrices (R[-l], R[0] and R[1J) are each K v x K v in size. The total number of operation then is 
6K V 2 per iteration. The computational complexity of the MDFIC algorithm depends on the total number of 
jtj. virtual users, which depends on the mix of users at the various spreading factors. For K v = 200 users (e.g. 

m 100 low-rate users) this amounts to 240,000 operations. In the current implementation two iterations are 

™ used, requiring a total of 480,000 operation. For real-time operation these operations must be performed in 

^ 1/15 ms. The total processing requirement is then 7.2 GOPS. Computational complexity is markedly 

%U reduced if a threshold parameter is set such that IC is performed only for values \yi[m]\ below the threshold, 

ip The idea is that if \ydm]\ is large there is little doubt as to the sign of b,[m], and IC need not be performed, 

iff The value of the threshold parameter is variable from stage to stage. 

Q 

fji Mapping to Hardware 

PI The above calculations are performed on a single 9"x6" card populated with four Power PC 7410 

f ; 'l processors. These processors employ the AltiVec SIMD vector arithmetic-logic unit, which has 32 128-bit 

17 vector registers. These registers can hold either 4 32-bit floats, 4 32 bit ints, 8 16-bit shorts, or 16 8-bit 

**" chars. Two vector SIMD operation (multiply and accumulate) can be performed by clock. The clock rate 

4* used for the current implementation is 400 MHz. The processors, however, can be operated at 500 MHz 

i , with higher clock speeds in the near future. Each processor has 32KB of LI cache and 2MB of 266MHz L2 

ff I cache. The maximum theoretical performance of these processors is thus 3.2 GFLOPS, 6.4 GOPS (16 -bit), 

or 12.8 GOPS (8-bit). The current implementation used a combination of floating-point, 16-bit fixed-point 

and 8-bit fixed-point calculations. 

The four PPC7410 processors are interconnected with a RACE++ 266MB/s 8-port switched fabric as 
shown in Figure 2. The high bandwidth fabric allows transfer of large amounts of data with very low 
latency so as to achieve efficient parallelism of the four processors. The maximum theoretical performance 
of the card is thus 51 .2 GOPS. 
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Figure 2. Partitioning of MUD functions across four processors 

As shown in Figure 2 the MDFIC and C-matrix calculations are allocated to a single processor. The other 
three processors are given to the R-matrix calculations which are considerably more complex. 

MUD BER Performance 



A sample of the Bit Error Rate (BER) performance of the MUD algorithm is shown in Figure 3. For 
comparison the matched-filter BER is also shown. The figure shows that MUD doubles system capacity. 
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Figure 3. LoglO bit error rate versus system capacity for matched 
filter (blue) and multiuser detection (red) 



The above performance is based on the following assumptions: 

• A single receive antenna is used 

• The target BER is 0.001 
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• The percentage of systems users in handoff is 30% 

• Other-cell interference is 35% of intra-cell interference. This is lower than the typical value (0.60) 
used. The reason is that the other-cell users in handoff with the cell of interest are included in the intra- 
cell interference. This is because the cell of interest is processing these users and hence can cancell 
there interference using MUD. 

• A 4-tap multipath channel is used. Each tap is Rayleigh fading. The composite power of all paths is 
perfectly power controlled. 

• The channel amplitude estimation error is 1 0% 

• The channel delay estimation is l A chip 

• The activity factor for voice is 0.40 

• The relative amplitude of the control channel is % = 0.5333 



Conclusions 

The current state of processor technology is such that iterative hard-decision MUD for the UMTS uplink 
can be implemented in software on a single card or daughter card populated with four Power PC 7410 
processors, connected together with a high-bandwidth RACE++ interconnect fabric. The use of short codes 
allows MUD to be performed at the symbol rate. The advantage of symbol-rate processing is that MUD can 
0 be introduced into a BTS as an enhancement to the conventional RAKE receiver. The MUD processing 

ifj takes the MF detection statistics, performs interference cancellation, and then delivers improved hard or 

h £| soft-decision symbol estimates to the symbol-rate BTS processing functions. The latency introduced is only 

a few milliseconds. In order to perform MUD at the symbol rate the R-matrices must be updated in real 
S|£ time. There is a minimal degradation in MUD efficiency if these elements are updated at a rate of once per 

jgj 1.33 ms. The R-matrices are used to cancel the multiple access interference through the MDFIC 

P interference cancellation technique. At a BER of 0.001 the use of the above MUD technique doubles 

3 system capacity. 
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1 . Introduction 

This report briefly describes long-code Multi-User Detection (MUD). Section 2 describes 
the long-code signal model, which is different from the short-code model. Section 3 
describes the matched-filtering operation for long codes and gives a lower bound on the 
GOPS required for long-code symbol-rate MUD. The lower bound is 1 9.7 TOPS (i.e. Tera 
Operations Per Second; 1 TOPS = 1000 GOPS). Because of the extreme computational 
complexity of symbol-rate MUD for long codes regenerative MUD is examined. It is shown 
in Section 4 that although regenerative MUD operates at the chip rate, the overall 
complexity is lower for long codes. Two methods are examined. The first method is a 
somewhat straight-forward implementation of regenerative MUD. The required 
computational complexity is shown to be 774.6 GOPS for 100 users. The second method 
is based on combining impluse trains and subsequently raised-cosine filtering the 
composite signal. The total computational complexity is shown to be 109.6 GOPS for 100 
users. Regenerative MUD is linear in the number of users, so that if the number of users 
is reduced to 64 the complexity drops to 70.1 GOPS. The complexity is also linear in the 
number of multipaths subtracted, so that if the number of multipaths subtracted is reduced 
from 4 to 2 the complexity drops to 35.1 GOPS. It may be desirable for MUD performance 
to subtract only the two largest multipaths due channel amplitude estimation errors. The 
above complexity figures are for a single interference cancellation stage. For two stages 
the computation is doubled. To perform regenerative MUD the baseband antenna stream 
data must be brought onto the MUD board. The required bandwidth is 123 MB/s. Note that 
the figures given above can perhaps be reduced through a clever implementation. A block 
diagram of regenerative MUD is shown to facilitate an investigation into the feasibility of 
an FPGA or ASIC implementation. 



2. Signal Model 
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The received signal mode! for short-code WCDMA is given in [1]. When long codes are 
used the signal model is different since effectively the codes change from symbol to 
symbol. We present here the WCDMA signal model for long codes. The baseband 
received signal can be written 

k=i m 

where t is the integer time sample index, T k = N k N c is the data bit duration, which depends 
on the user spreading factor, N k is the spreading factor for the klh virtual user, N c is the 
number of samples per chip, Kis the total number of physical users, w[t]\s receiver noise, 
and where j to [*] is the channel-corrupted signature waveform for the tth virtual user over 
the mth symbol period. The concept of virtual users is used to account for both the 
DPDCH and the DPCCH. Hence if there are K physical users, then there are K v = 2K 
virtual users. The user signature waveform and hence the channel-corrupted signature 
waveform vary from symbol period to symbol period since long codes by definition extend 
over many symbol periods. For L multipath components the channel-corrupted signature 
^ waveform for virtual user k is modeled as 

w L 

« s km [t] = Y j a kp s km [t-? kp ] (2) 

O where a kp are the complex multipath amplitudes. The amplitude ratios p k are incorporated 

CP into the amplitudes a kp . Notice that if k and / are virtual users corresponding to the 

DPCCH and the DPDCH of the same physical user then, aside from scaling the by j3 k 
and Pi, a kp and a, p , are equal. This is due to the fact that the sig nal waveforms for both the 
ft DPCCH and the DPDCH pass through the same channel. 

£ The waveform s km [t] is referred to as the signature waveform for the kth virtual user over 

H the /mh symbol period. This waveform is generated by passing the spreading code 

: " sequence c km [n] through a pulse-shaping filter g[t] 

= ^8[t-rN c ]c k [r + mN k ] 

r=0 

where g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine pulse as 
opposed to a root-raised-cosine pulse, the received signal r[t] represents the baseband 
signal after filtering by the matched chip filter. 



3. Matched filter 

The received signal above, which has been match -filtered to the chip pulse, must next be 
match-filtered by the user code-sequence filter. The resulting detection statistic is 
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denoted here as y k [m], the matched-filter output for the kth virtual user over the mth 
symbol period. Since there are K v codes, there are K v such detection statistics, which are 
collected into a column vector y[m]. The matched-filter output yifmj, for the Ith virtual user 
can be written 



(4) 



where 5j is the estimate of a,J , f„ is the estimate of t /? , and n/M is the match-filtered 
receiver noise. Substituting r[t] from Equation (1) above gives 



*[»] - Re |£a; X SXt vJ^. +T aw [«.«']* 1 [«'] 

+ w[/iW f + f lq + mT, ] jc^ [n]| 
= Re {|X ^|[IIIVm[^ +T^[m.«'» 4 [irf]j-cL[»]| + n,[«] 

m ' fc=l [ 9 =1 p=3 J 

1 

«=0 

r? ; [m] = Re Raj ■ X + m7 ) ] " c l M [ 

[m, m' ] = f „ - + mT, - tri T k (5) 

In order to subtract interference we must, at a minimum, calculate C, kqp [m,m'] for all virtual 
users and for all multipath components. A lower bound on the computational complexity 
can be determined by considering the above calculations for synchronous users. For 
synchronous users, all at the highest spreading factor, the required number of operations 
to calculate C, kqp [m,m'] is 8(256)(2KL) 2 = 1.31 Gops for K = 1 00 and L = 4. For real time 
operation 15000 such computations must be performed every second. This amounts to 
1 9.7 TOPS (i.e. Tera Operations Per Second). 



4. Regenerative MUD 
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Page N^Jj®^ e Qf the extreme computational complexity of symbol-rate MUD for long codes it is 
advantageous to resort to regenerative MUD when long codes are used. Although 
regenerative MUD operates at the chip rate, the overall complexity is lower for long 
codes. For regenerative MUD the signal waveforms of interferers are regenerated at the 
sample rate and effectively subtracted from the received signal. A second pass through 
the matched filter then yields improved performance. It turns out that the computational 
complexity of regenerative MUD is linear in the number of users. 

The received signal can be written 

2K L 

= £r t [/]+*#] ( 6 ) 

r k M = X Vt-x kp - mT k J> k [m] 

m p=l 

S Subtracting interference gives a cleaned-up signal x{t] 

ll3 2K 

y3 x,[t] = r[t]- £r t M 

if] i=Ufc*Z 

Jj = r[f]-£rJf]+r ; M 

q =m-m+nit-\ 

W = f,[t\ + r„,[t] (7) 

Hi 2K 

L 

h W = IIS s *m it ~ f» -™T k y> k [m] 
Two methods are presented below for performing regenerative MUD. 
First Method 

!n order to subtract interference we must reconstruct (regenerate) the waveform s km [t] as 
given in Equation (3). The waveform can be reconstructed using 
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r=0 
W t /4-l 3 

= X Tgit-<.4p+j) N M*p+n 

7=0 

c fonp U'l = c te [4p+;'] 

The idea is that s km [t] can be represented as a summation of shifted waveforms s kmp [t], 
which are entirely specified by the 8 binary numbers comprising the complex sequence 
<W/7 of length 4. Hence there are only 2 8 = 256 such waveforms. For what follows we 
assume that the signals are sampled at N c = 8 samples per chip. Each is of length 96 + 
3(4) = 108 samples assuming that g[t] is of length 96. For 2 bytes per sample (real and 
imaginary parts) the total memory requirement is 216*256 = 55296 bytes, which spills out 
of L1 cache, but fits entirely in l_2 cache. 

To generate r k [t] for a single symbol period, 64 of these waveforms must be read from 
memory. For each of these 64 waveforms L complex macs are required per sample per 
symbol period. Hence 64(8L)(108) operations are required per symbol period. For L = 4 
this amounts to 64(32)(108) = 221184 operations per symbol period (1/15 ms), or 3.32 
GOPS. The formation of r res [t]then requires 2/aimes this, or 3.32(200) = 664 GOPS for K 
= 100 physical users. To form /}[*] + r w [f] requires an additional 2(96+255*4) = 2232 
operations per symbol period per virtual user, or another 6.7 GOPS. Finally, the matched 
filter operation needs to be performed for each user, which from Equation (4) requires 
NLK complex macs (A/= 256), or 256(4)(100)(8)*15000 = 12.3 GOPS. The GOPS figures 
above are for a single antenna. For two antennas the operations are doubled. Hence the 
total computational complexity is 2(664 + 6.7 + 12.3) = 1.37 TOPS. This is for a single- 
stage MPIC algorithm. For two stages the computation is doubled. 

To perform regenerative MUD the baseband antenna stream data must be brought onto 
the MUD board. The required bandwidth is 

[2 Bytes(complex)/Sa/Ant][2 Ant][8 Sa/chip][3.84 Mchips/ second] = 123 MB/s 



Second Method 

The second method is to represent the waveform for each multipath for each user as a 
complex impulse train with N c = 8 samples per impulse. The complex amplitude of each 
impulse is the product of the complex chip, complex multipath amplitude and the binary 
(real) data bit estimate. These 2KL complex streams (times 2 for 2 antennas) are added to 
form a composite signal. Since this composite signal is a sum many impulse trains, all 
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asynchronous, the composite signal is a dense (i.e. no systematic zeros) signal at the 
sample rate. A block diagram of the processing is shown in Figure 1 . 



m 
Q 




Figure 1. A block diagram of the long-code MUD processing 
From Equations (7) and (8) 

2K L 

^XXXV^-^-^im™] 

2K L N t -1 

=XX^XX^-^ -«r»-^WW 

k=l p=\ m r=0 

= f,iKl, N ^S[t-f ¥ -(r + mN k )N c ]c k [r + mNj k [l(r + mN k )/Nj 

2K L - i L 

= £ X X ^ - f *> - "^ ]c * [ L n 1 N « ^ 

= X X ^ XX Slr]S[t - r - f kp - nN c }c k [n]b k [n I N k J] 

k=\ p=\ r n 

2K L ~ I L 

=XX^XX^- r - f * P -^] ■< c * [n] ' b *i*t*ki 

= J,g[r]cc[t-r] 

2K L - I L 



(9) 



where cc[t] is the composite signal. For each symbol period this requires 256(1 0)(2KL) 
operations per antenna. For two antennas this amounts to 5120(200)(4) = 4096000 
operations per symbol period, or 61 .4 GOPS. 
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Page N°J|| tjmate of the received si g na | is then determined by passing the composite signal 
through the raised-cosine filter g[t] oi length 96, which requires 96 real macs, or 192 real 
operations, per sample per real stream. There are a total of 4 real streams (2 antennas, 
real and imaginary streams). The total GOPS then for N c = 8 samples per chip is 
192(4)(8)(3.84M) = 23.6 GOPS. 

The final step is to pass the cleaned-up signal x,[t] = r,[r] + r res [t] through the matched-filter 
(i.e. rake receiver) which gives the improved detection statistic 

[q=l 27V, n=0 J 

= R 4lX -^XWc +f <« +rnT l ]-cl[n\\ 

L?=i «=o J 

| =Re {|r^ 'W^% &lq ' Slm ' {nNc+ilq -v +(m - m ' ) ^^ [m ' ] ]- cL[n] } +> '^ Cm] 

W sRejxaJsj-^M + y^, [m] 

5 =A 2 -W + :O m ] 



y « , [m] - Rejt • JL X r ra [niV c + f „ +mT l ]- c] m [«]| 



(10) 



The matched filter operation requires A/L/C complex macs, or 256(4)(100)(8)*15000 = 12_ 3 
GOPS The GOPS figures above are for a single antenna. For two antennas the 
operations are doubled, giving 24.6 GOPS. The total computational complexity for the 
second method is then 61.4 + 23.6 + 24.6 = 109.6 GOPS. 
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1 . Introduction 



This report investigates a number of different methods for calculating the R- 
matrix elements. There are two parts to the calculation. First is the calculation of 
the user code correlations at lag offsets determined by the searcher receivers. 
This calculation must be performed every time a multipath component changes 
to a new lag. The assumption used here is that every 100 ms one multipath 
component changes to a new lag for each user. Hence, if each user has 4 
multipath lags, then all R-matrix elements will have changed after 400 ms. The 
validity of this assumption will have to be tested with measured data. Note that 
the WCDMA standard call out a test with 2 multipath components, where one lag 
changes every 191 ms [1]. The second part is the actual calculation of the R- 
matrix elements, which requires a double summation of code correlations over all 
multipath components, with each term scaled by the Rayleigh-fading multipath 
amplitudes. The maximum time period to perform this calculation is about 1 .33 
ms. Hence there are two parts to the calculation, each with a different update 
rate. 

Section 2 is devoted the first part of the calculation, the code correlations. 
Section 3 covers the actual calculation of the Rmatrix elements. 



Page No. 142 



EV 093 931 868 US 
Page No. 169 



2. Calculation of User Code Correlations 

The R-matrix elements can be expressed as [2] 



C lkg ,[m'] = JLj^JnAr, +ntT +f„ -f w Vc t [H] (D 



2N ( ' 

= ^-Yl,sKn-p)N c +m'T+f lq -* v ]c t [/>]- <[n] 

where C /fcw - /m'7 is a five-dimensional matrix of code correlations. Both / and k 
range from 1 to K v , where K v is the number of virtual users. If there are K physical 
users, all operating at the highest spreading factor, then there are K v = 2K virtual 
users. For now consider K = 128 so that K v = 256. The indices q and q' range 
from 1 to L, the number of multipath components, which for this report is 
assumed to be equal to 4. The symbol period offset m' ranges from -1 to 1 . The 
total number of matrix elements to be calculated is then 
vV c =3(K V L) 2 =3(1024) 2 =3M complex elements, or 24 MB if each element is a 
float. This number is reduced, however, due to the symmetries 

= X E Si-in - p)N c -m'T -f lq + X kq . ] c, [n] ■ c k [p] 

(2) 

= ^r^E sKn ~ p)N < + m ' T+f >« - f w ]c ** [p] ' Ci [n] 

so that it is sufficient to store elements for offsets m' = 0,1. The memory 
requirement is then 16 MB if each element is a float. If the elements are stored 
as bytes the requirement is reduced to 4 MB. 



2 
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Referring to Equation 1 , line 2, it is evident that each element of C ikqq > [m'J is a 
complex dot product between a code vector q and a waveform vector Skqq: The 
length of the code vector is 256. The length of the waveform vector is L g + 255N C , 
where L g is the length of the raised -cosine pulse vector g[tj and N c is the number 
of samples per chip. The values for these parameters as currently implemented 
are L g = 48 and N c = 4. The length of the waveform vector is then 1 068, but for 
the dot product it is accessed at a stride of N c = 4, which gives effectively a 
length of 267. Note that the code and waveform vectors in general do not entirely 
overlap. Also note that an increment or decrement in the symbol offset index m' 
slides the waveform vector 256 elements to the left or right respectively. Figure 1 
shows that the total number of complex macs (cmacs) for all three (m' = -1,0, 1) 
dot products is 267, irrespective of any relative offset. 





j 




r 



















Figure 1. Overlap of waveform and code vectors. The total 
number of complex macs (cmacs) for all three (m' =-1,0, 1) 
dot products is 267, irrespective of any relative offset. 

Hence for any given combination of indices Ikqq' the three elements Cik qq - [m'J, 
corresponding to m'= -1, 0 and 1 require 267 cmacs to calculate all three. Since 
there are (K v Lf combinations of indices, the calculation of all elements Cikqq' [m'J 
requires (K v Lf (267) cmacs. Given the symmetry condition, only half of the 
elements need to be calculated, and noting that each cmac requires 8 operation 
to perform, the total number of operations required is 



N m = i(K„L) 2 (267)(8) = i(1024) 2 (267)(8) = 1.12 G ops 



(3) 



The total number of GOPS (Giga Operations Per Second), then, given the 400 
ms update rate is 



G0PS 400ms 400ms 

The next section addresses the calculation of the R-matrix elements. 



(4) 
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3. Calculation of R-matrix Elements 

Consider the calculation of the R-matrix elements 

4=1 4=1 

The total number of matrix elements to be calculated is N p = 3K 2 V . This number 
is reduced, however, due to the symmetries 

4=1 4=1 q=\q=\ I lS k J 



(6) 



J so that the total number of matrix elements to be calculated is N p = f £ v 2 . 

I 

0 Now let us consider the operations per element. Dropping explicit reference to 

P the symbol period offset [m], the matrix elements are 



/M*=ZXMvV c **-} (7) 

4=1 4=1 



A brute-force calculation requires L 2 {6 + 3 + 1) operations (1 complex multiply, 
one half-complex multiply - i.e. the real part ~ and one real add, or 6 real 
multiplies and 4 real adds). The total operations is then 

N ops =±(K v L) 2 (10) (8) 



For a vehicular speed of 120 km/h the Doppler frequency is 216.67 Hz for a user 
at frequency 1950 MHz. The coherence bandwidth is thus 433.33 MHz, and the 
corresponding coherence time is about 2.3 ms. Hence the multipath amplitudes 
are changing with a time constant of about 2 ms, and consequently the second 
part of the calculation must be updated at least every 2 ms. The channel 
amplitudes are calculated on a time slot by time slot basis. Each time slot is 
10/15 = 2/3 = 0.67 ms. Hence 2 ms equals 3 time slots, whereas two slots equals 
1.33 ms. Figures 2 and 3 below show the MUD efficiency versus user velocity for 
2 ms and 1 .33 ms update times respectively. The plots show that to be able to 
effectively handle high velocity users the update time should be 1.33 ms. When 
users are at various speeds the interference from low speed users is cancelled 
more effectively than the interference from high speed users. The MUD efficiency 



4 
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will then be an average of the MUD efficiency corresponding to each user's 
speed. 



Calculations updated e\ray 2 ms (3 time slots) 




User Velocity (kmph) 

Figure 2. MUD efficiency versus user velocity for a 
2 ms R-matrix update time. 



Calculations updated e\ 




£ 0 - 5 

| 0.4 



User Velocity (kmph) 

Figure 3. MUD efficiency versus user velocity for a 
1.33 ms R-matrix update time. 



The calculations below are based on a 1 .33 ms update time. Note that most of 
the capacity and coverage benefits calculated for MUD so far have assumed 
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70% MUD efficiency. The 1.33 ms update time is sufficient to achieve 70% MUD 
efficiency. The total GOPS are then, 



1.33 ms 1.33 ms 



(9) 



a 

r 

m 



where we have assumed L = 4 multipath components. A better way to perform 
this operation is 



?=1 L ?=1 J 



(10) 



The inner sum is a matrix-vector multiply, hence requiring L 2 cmacs, and the 
outer sum is the real part of a compex dot product, which requires L half-cmacs. 
The total is then (L 2 + L/2) = 1.125 L 2 cmacs (for L = 4) times 8 operations per 
cmac, or 9L 2 operations, which gives 



_f(/T v L) 2 (9)_l,5(256-4) 2 (9)_ 



1.33 m 



1.33 ms 



(11) 



The above calculations are represented in terms of complex numbers, which are 
not directly calculable. To express the above equations explicitly in terms of real 
numbers it is convenient to cast the calculations into matrix form 



q=l g=l 



«/2 • 






c m2 - 


• c mL 


d ki 






c lk21 


^l k 22 


■ c Ik2L 










C»L2 




AkL. 



-=Re& C lk a k ] 



(12) 









'c lkn 


c !kn 


• •• c 


a k = 




C lk = 


c lk2l 


c, k22 


••• c 




a kL 




c lkLi 


C,kL2 


-■■ c 



The quadratic form ai H Ci k a k can be expressed 
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± 



Re{af • C, k ■ a k }= Re{[a r r - jaf} [C r + jC,]- [b r + jb, ] 
= Refc - jaj ] [C r b r - C,b, + jiCfr + C,b r )]} 



= Re- 



a T r C r b r - alCfo + aJC r b t + ajc t b r } 
+ jia^Cfy +a T r C i b r -a]C r b r +a, 7 C,Z?.)J 

_k "!tc r -c,Jb r ] 

k c,\b,\ 

The matrix-vector multiplication requires (2Lf macs. The dot product adds (2L) 
macs so that the total is (2Lf + (2L) macs. For L = 4 we have 7. 125(2Lf macs = 
4.5L 2 macs = 9L 2 operations. The total GOPS are then 

^ JWWJ^^^ ioKm (14, 



pj Now consider a different formulation which attempts to reuse the amplitude- 

amplitude multiplications. Consider the calculation a T C b 

a T C b = tr[a T C b]=tr[c (ba T )]=tr[C X] 

(15) 

X =ba T 

The calculations to produce matrix X are pure multiplications, but the elements, 
once calculated, can be reused for the other virtual users corresponding to the 
same physical users. For voice-only users there are 2 virtual users per physical 
user. For data users there can be up to 65 virtual users per physical user. For 
now, however, we stay with our 128 voice-user scenario. To calculate X, then, 
requires (2Lf = 4L 2 multiplications. This calculation is performed once per pair of 
physical users, so the total number of operations is 

N ops =(KLf(4) = (K v L)\\)=\(K v L)\\) (16) 
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Effectively, then, X requires (2/3)L 2 operations. The details to calculate a T Cb 
are 



a T C b = tr[C X]=tr\ 



(17) 



a 



where c-, is the Ith row of C and x, is the Ah column of X. Hence we have 2L dot 
products of length 2L, which require (2L/ macs = 8L 2 operations. To calculate 
a" -Cb then requires 8L 2 + (2/3)L 2 = 8.67L 2 operations, which gives 



_ f(* v L) 2 (8.67) _ 1.5(256-4) 2 (8.67) 

yv ccvs - ttt — — = 10.3 GC 

1.33 ras 1.33 ms 

A better way to perform this calculation is as follows 

= SERe{(x;,-^;,).(c;, + y -c;,)} 

q=l q=i 

= tt( x ^c; q +x-,.aj 



(18) 



(19) 



x„- s «, = (< + K ) • (<■ - ja' q .) = x r qq . + jx' q , 

^j f ' = a' q -a q . -a q -a' q . 

where for convenience we have dropped A k , the Ik subscripts and the hat 
symbols. The calculation of X requires 

N ops = (KL) 2 (6) = (K v L) 2 (6/4) =f (^L) 2 (l) (20) 

operations. Note that, once the X values are calculated, the remainder of the 
calculation is a long dot product of length 2L 2 , hence requiring 2L 2 macs, or 4L 2 
operations, which gives 
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ff „ != l^.M). 5 . 9COra (21 , 

1.33 ms 1.33 ms 



Dual Diversity Antennas 

When dual diversity antennas are employed, the calculation of the R-matrix 
elements becomes 



(22) 

*Q *=' 

O 

lj To calculate Xfor dual diversity antennas, then, requires 

HP 

| W ops =(/:L) 2 (14) = (/: v L) 2 (14/4) = f(^L) 2 (7/3)=|(^ v L) 2 (2.33) (23) 

operations. The remainder of the calculation is again a long dot product of length 
2L 2 requiring 4L 2 operations, which gives 

„ m = M^!m . 1.5(256.4f(6.33) _ ? 5 Gore 
1.33 1.33 ms 



Reuse of C data 



So far we have not addressed the problem associated with a lack of data reuse, 
which renders our calculations I/O limited. The C data can be reused by 
introducing extra latency into the calculations. For a given user, a single 
multipath component changes on average once every 100 ms, or once every 150 
slots. Suppose we collect and save in cache 4 amplitude estimate vectors a k [q], 
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where q is the 2 ms update index. The total latency is then 8 ms = 12 time slots. 
During this time the probability that a multipath lag changes is (8 ms)/(100 ms) = 
.08. The probability that the matrix C* changes is then = 1-(1-0.08) 2 = 0.15. 
Hence for most matrices Ci k we will be able to calculate 

a?[q]C, k a k [q] (25) 

for 12 time slots q for only one read of Cm from memory. The penalty for this 
reuse is the 8 ms of latency incurred. 



43 
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j„ a Subject: Theoretically optimum load balancing for the R File Ref: mjv-9.doc 

Q matrix calculations 

o 
a:i 

™jk This memo describes the calculation of optimum R matrix partitioning points in 

. p normalized virtual user space. These partitioning points provide an equal, and hence 

■ i balanced, computation load per processor. The computational model of the R matrix 

m calculations does not include any data access overhead or caching effects. It is shown 

,, ; that a closed form recursive solution exists that can be solved for an arbitrary number of 

p processors. 

|=l Although three R matrices are output from the R matrix calculation function, only half of 

the elements are explicitly calculated. This is due to the symmetry condition that exists 

P between R matrices: 

FU R lJc (m)=£R kJ (-m). 

In essence, only two matrices need to be calculated. The first one is a combination of 
R(l) and R(-l). The second is the R(0) matrix. In this case, the essential R(0) matrix 
elements have a triangular structure to them. The number of computations performed to 
generate the raw data for the R(l)/R(-1) and R(0) matrices are combined and optimized as 
a single number. This is due to the reuse of the X matrix outer product values across the 
two R matrices. Since the bulk of the computations involve combining the X matrix and 
correlation values, they dominate the processor utilization. These computations are used 
as a cost metric in determining the optimum loading of each processor. 



The optimization problem is formulated as an equal area problem, where the solution 
results in each partition area to be equal. Since the major dimensions of the R matrices 
are in terms of the number of active virtual users, the solution space for this problem is in 
terms of the number of virtual users per processor. By normalizing the solution space by 
the number of virtual users, the solution is applicable for an arbitrary number of virtual 
users. 
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Figure 1: Normalized R matrix computation model. 

Figure 1 shows the model of the normalized optimization problem. The computations for 
the R(l)/R(-1) matrix are represented by the square HJKM, while the computations for 
the R(0) matrix are represented by the triangle ABC. From geometry, the area of a 
rectangle of length b and height h is 

A r = bh . 

For a triangle with a base width b and height h, the area is calculated by 

A =-bh. 

1 2 

When combined with a common height a h the formula for the area becomes 

A, =A n +A n 

1 2 
= aa-, + — a, . 
' 3 2 ' 

1 2 

= a, + — a: 
' 2 

The formula for ^t, gives the area for the total region below the partition line. For 
example, the formula for A2 gives the area within the rectangle HQRM plus the region 
within triangle AFG. For the cost function, the difference in successive areas is used. 
That is 

B,. =A, 

1 2 1 2 

= — o, + a, a,_, - a,, 

2 ' ' 2 ' 1 " 
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For an optimum solution, the B t must be equal for i = 1, 2, N, where N is the number 
of processors performing the calculations. Because the total normalized load is equal to 
Am, the loading per processor load is equal to An IN. 

A= 4sL = A = _L >fOT i=l > 2, N. 
' N 3 2N 

By combining the two equation for B h the solution for aj is found by finding the roots of 
the equation: 

1 2 1 2 3 A 

— a, + a, a: . -a,. , = 0 . 

2 ' ' 2 ,_I ' 1 2N 

The solution for a t is: 



a, = -l + Jl + af 1 +2a i _ 1 + — ,for/ = l,2,...,N. 
V N 

TT Since the solution space must fall in the range [0, 1], negative roots are not valid 

5f solutions to the problem. On the surface, it appears that the a, must be solved by first 

' solving for case where i = 1. However, by expanding the recursions of the a t and using 

% the fact that a 0 equals zero, a solution that does not require previous a, i = 0, 1, n-l 

,J 3 exists. The solution is: 

i a,-=-l + Jl + 



3/ 
h N 



Table 1 shows the normalized partition values for two, three, and four processors. To 
calculate the actual partitioning values, the number of active virtual users is multiplied by 
the corresponding table entries. Since a fraction of a user cannot be allocated, a ceiling 
operation is performed that biases the number of virtual users per processor towards the 
processors whose loading function is less sensitive to perturbations in the number of 
users. 



Table 1: Normalized partition locations for two, three, and four processors. 



Location 


: Twaprdcessbrs 


h Three processors 


1 out processor* 


a, 


-i+tH (°- 5811 > 


-1 + V2 (0.4142) 


-1 + ^| (0.3229) 


a 2 




-1 + V3 (0.7321) 


-1 + Jl (0.5811) 


a 3 






-1 + ^ (0.8028) 
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Jonathan Schonfeld 



Date: 23-FEB-2001 



From: 



Nmf 



Subject: Degraded mode of operation for the MUD 
algorithm 



FileRef: mjv-018- 
degraded_mode_desc.doc 



Reference [ 1 ] showed that the load balancing for the R matrix calculations resulted in a non -uniform 
partitioning of the rows of the final R matrices over a number of processors. In summary, the 
partition sizes increase as the partition starting user index increases. 

When the system is running at full capacity (i.e. the maximum number of users is processed while 
still within the bounds of real-time operation ) and a computational node has a failure, the impact can 
be significant. 

This impact can be mmimized by allocating the first user partition to the disabled node. Also the 
values that would have been calculated by that node are set to zero. This reduces the effects of the 
failed node. Also, by changing which user data is set to zero (i.e. which users are assigned to the 
failed node ) the overall errors due to the lack of non -zero output data for that node are averaged 
over all of the users, providing a "soft" degradation. 



[1] M. Vinskus. "mjv-009: Theoretically optimum load balancing for the R matrix calculations." 
31-AUG-2000. 

[2] M. Vinskus. "mjv-010: Prehminary degraded MUD operation results." 19-OCT-2000. 
[3] J. Oates. "jho-001: MUD Algorithms", 25- APR-2000 
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To: Wireless Communications Group 
From: J. H. Oates 

Subject: Methods for Calculating the C-matrix Elements Date: November 13, 2000 
1. Direct Method 

The direct method for calculating the C-matrix elements is 

L L . , 

(1) 

C w J-m'] = ^C ; ; w .[m'] (2) 



Symmetry 



Due to symmetry there are 1.5(K v Lf elements to calculate. Assuming all users 
are at SF 256, each calculation requires 256 cmacs, or 2048 operations. The 
probability that a multipath changes in a 10 ms time period is approximately 
10/200 = 0.05 if all users are at 120 kmph. Assuming a mix of user velocities, 
let's say the probability is 0.025. Since the C-matrix elements represent the 
interaction between two users, the probability that C-matrix elements change in a 
10 ms time period is approximately 0.10 for all users are at 120 kmph, or 0.05 for 
a mix of user velocities. The GOPS are tabulated in Table 1 below. 



Page No. 156 



EV 093 931 868 US 
Page No. 183 



The C-matrix elements also need to be updated when the spreading factor 
changes. The spreading factor can change due to 

• AMR codec rate changes 

• Multiplexing of DCCH 

• Multiplexing data services 

For lack of a better number, assume that 5% of the users, hence 10% of the 
elements change rate every 1 0 ms. 



Table 1. GOPS to update C-matrix elements using the direct method. 



K v 


High velocity 
users 


1.5(K v Lf 


Gops 


Percentage 
change 


GOPS 


200 


100% 


960,000 


1.966 


20 


39.3 


200 


50% 


960,000 


1.966 


15 


29.5 


128 


100% 


393,216 


0.805 


20 


16.1 


128 


50% 


393,216 


0.805 


15 


12.1 



D 
ft 



II; 
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2. FFT Method 

The FFT can be used to calculate the correlations for a range of offsetsx using 

(3) 

m The length of the waveform s*/f7 is L g + 255 N c = 1 068 for L 9 = 48 and N c = 4. This 

□ is represented as N c waveforms of length LJN C + 255 = 267. 

y3 One advantage of this approach is that elements can be stored for a range of 

43 offsets x so that calculations do not need to be performed when lags change. For 

J delay spreads of about 4us 32 samples need to be stored for each m'. 
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3. Using Code Correlations 

The C-matrix elements can be represented in terms of the underlying code 
correlations using 

C lkqq \m^^-^s k [nN c +m'T+f lq -f^J-c/W 

= ^ ^ * [mA ^ + T ] ' Ckln ~ m] ' c ' [n] 

^gimN.+x]—^ [n] - c t [n - m] (4) 
= ^g[miV c +T]r ; Jm] 



£ rjm] = ^5>;[n]cj«-m] 

H If the length of g[t] \s L 9 = 48 and A/ c = 4, then the summation over m requires 

ff 48/4 =12 macs for the real part and 12 macs for the imaginary part. The total ops 

% is then 48 ops per element. (Compare with 2048 operations for the direct 

±: method.) Hence for the case where there are 200 virtual users and 20% of the C- 

matrix needs updating every 10 ms the required complexity is (960000 el)(48 
* y ops/el)(0.20)/(0.010 sec) = 921.6 MOPS. This is the required complexity to 

compute the C-matrix from the r-matrix. The cost of computing the r-matrix must 
also be considered. There is reason to hope that the r-matrix can be efficiently 
computed since the fundamental operation is a convolution of codes with 
elements constrained to be +/-1 +/-j. 

The r-matrix elements can be calculated using 

• the FFT 

• Modulo-2 arithmetic 

• Hardware XOR 

• Short-code generator(?) 
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4. Using Fundamental Correlations 



The waveform s k [t] can be decomposed into fundamental waveforms 
corresponding to 4-chip segments of the corresponding complex user codes. 
There are 2* = 256 such waveforms. Each of these can be correlated with 
another 256 possible 4-chip code segments. For each correlation there are about 
64 offsets that produce a non-zero correlation. Hence all correlation calculations 
can be represented in terms of 256(256)(64) = 4M fundamental complex 
correlations. The C-matrix elements are then 

C aqt -[m'] = -±-2 d s k [nN e +m'T+t [q -f^c^n] 



I C JT ] = ^- X h [nN c + T] • c, [n] 

03 



QM-ii^rt^[^c+T]-c;[n] 

i=0 ;=o Z/V / n=0 



1 3 



(5) 



Using the above, each C-matrix element requires 64(64) = 4096 complex adds, 
or 8192 operations to calculate. (Compare with 2048 operations for the direct 
method.) 

Alternately, the calculations can be represented in terms of 4-chip real code 
segments and the corresponding waveforms. Hence all correlation calculations 
can be represented in terms of 16(16)(64) = 16K fundamental real correlations. 
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To: Wireless Communications Group 
From: J. H. Oates 

Subject: Calculation of C-matrix Elements Date: August 10, 2000 

7: 1 . Introduction 

S The C-matrix elements are used to calculate the R-matrices, which are used by the MDF 

35 interference cancellation routine. Each C-matrix element can be calculated as a dot 

rt product between the Mh user's waveform and the /th user's code stream, each offset by 

m some multipath delay. For this method of calculation, each time a user's multipath profile 

J changes all C-matrix elements associated with the changed profile must be recalculated. 

0 It is estimated that a user profile changes every 100 ms. This number, however, is based 

y on very little data, and there is considerable risk that profiles may change more rapidly 

H= and compromise real-time operation. In addition, there is a large amount of overhead that 

£ must be performed before each dot product. In a recent benchmark the overhead 

P consumed nearly all of the time allocated for the entire C-matrix update. Finally, if the C- 

fli matrix is calculated as described above then an entire processor must be allocated for 

this calculation. 



In view of the above observations a better approach is to pre-calculate the code 
correlations up-front when a user is added to the system. This calculation is performed 
over all possible code offsets and the calculations are stored in a large array, 
approximately 21 Mbytes in size. We will henceforth refer to this large matrix as the r 
matrix. The C-matrix elements are updated when a profile changes by extracting the 
appropriate elements from the r matrix and performing minor calculations. Since the r 
matrix elements are calculated for all code offsets the FFT can be effectively used to 
speed up the calculations. Since all code offsets are pre-calculated, there is no risk 
associated with rapidly changing multipath profiles. Under normal operating conditions 
when the number of users accessing system is constant the resources which must be 
allocated to extracting the C-matrix elements are minimal, and so extra resources may be 
allocated to the R-matrix calculation. 

Section 2 below outlines the calculation of the r matrix elements. It is shown that the r 
matrix elements are given in terms of a convolution. Section 3 shows how to calculate the 
r matrix elements using the FFT. Section 4 describes how the r-matrix elements might be 
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summary with conclusions is given in section 6. 

2. C-matrix Elements Expressed in Terms of Code Correlations 

The R-matrix elements are given in terms of the C-matrix elements as [1] 

p lk [m']A,A k =XXRefe;a v ■C lkqq .[m , ]\ 



(D 



C >kqq \m'^^-^s k [nN c +rriT + f lq -f^-c^n] 



where C, kqq {m'] is a five-dimensional matrix of code correlations. Both / and k range from 1 
to K v , where K v is the number of virtual users. The indices q and q' range from 1 to L, the 
H> number of multipath components, which is assumed to be equal to 4. The symbol period 

O offset m' ranges from -1 to 1 . The total number of matrix elements to be calculated is then 

£ n c = 3(K V L) 2 = 3(800) 2 = 1.92M complex elements, or 3.84 MB if each element is a byte. 

% This number is cut in half, however, due to the symmetries [2] 

| C % ,t-m'] = ^C;,[m<] (2) 

H The memory requirement is then 1 .92 MB. 

t Referring to Equation (1) it is evident that each element of Q kqq {m'] is a complex dot 

£ product between a code vector c, and a waveform vector s kqq -. The length of the code 

" vector is 256. The waveform s k [t] is referred to as the signature waveform for the kth 

?y virtual user. This waveform is generated by passing the spread code sequence c k [n] 

through a pulse-shaping filter g[t] 

^[t] = Tgit-pN c ]c k [p] (3) 



where N = 256 and g[t] is the raised-cosine pulse shape. Since g[t] is a raised-cosine 
pulse as opposed to a root-raised-cosine pulse, the signature waveform s k [t] includes the 
effects of filtering by the matched chip filter. Note that for spreading factors less than 256 
some of the chips c k [p] are zero. The length of the waveform vector is L g + 255N C , where 
L g is the length of the raised-cosine pulse vector g[t] and N c is the number of samples per 
chip. The values for these parameters as currently implemented are L g = 48 and N c = 4. 
The length of the waveform vector is then 1 068, but for the dot product it is accessed at a 
stride of N c = 4, which gives effectively a length of 267. 

The raised-cosine pulse vector g[t] is defined to be non-zero from t = -Lg/2 + 1±c/2, with 
g[0] = 1. With this definition the waveform s^r; is non-zero from r = -Lg/2 + 1: L</2+ 255N C . 
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By combining Equations (1) and (3) the calculation of the C-matrix elements can be 
expressed directly in terms of the user code correlations. These correlations can be 
calculated up front and stored in SDRAM. The C-matrix elements expressed in terms of 
the code correlations r Ik [m] are 

C lkqq lml^^-J 4 S k [nN c + m'T + f lq -f^cfr] 

= XX Si(n ~ P)N C + m'T + f lq -x kq , ] • c k [p] ■ c]{n} 

= ^v~XX $l mN c + T] ■ c 4 [n - m] ■ cM 

= X SimN c + T] ~ £ C ; [n] ■ c t [n - m] (4) 

O Since the pulse shape vector g[n] is of length L g there are at most 2LJN C = 24 real macs 

§i to be performed to calculate each element C, kqq [m']. (The factor of 2 is because the code 

correlations T !k [m] are complex.) Given x it is important to be able to efficiently calculate 
D the range of values m for which g[mN c + x] \s non-zero. The minimum value of m is given 

W by m min1 N c + x = — Lg/2 + 1 . Now x is given by x = m'NN c + x, q - x kq -. If each x value is 

decomposed x, q = n, q N c + p lq , then m min1 = ceil[ (-x - Lg/2 + 1)/A/ C ] = -m'N - n, q + n kq - - 
f LJ{2N C ) + ceil[ {p kq >- p lq + 1)/A/ C ]. Now ceil[ {p kq - p lq + 1)/A/ C ] will be either 0 or 1 . It is 

convenient to set this to 0. In order that we do not access values outside the allocation for 
m g[nj we must set g[n] = 0.0 for n = - Lg/2: — Lg/2 — (N c — 1 ). Note that of the N<? possible 

values for ceil[ (p kq — p, q + 1)/A/ C ], all but one are 0. Hence we have 

m mi0 , = -m'N - n, q +n kq .- L g /(2N C ) (5) 



Note that L g must be divisible by 2N C , and that Lg/{2N C ) should be a system constant. 

The maximum value of m is given by m max rA/ c + x = L</2. This gives m^i = floor[ (-x + 
LJ2)/N C ] = -m'N- n lq + n kq - + LJ{2N C ) + floor[ (p kq - p lq )/N c ]. Now floor[ {p kq - p, q )/N c ] will 
be either -1 or 0. It is convenient to set this to 0. In order that we do not access values 
outside the allocation for g[n] we must set g[n]= 0.0 for n = -Lg/2 + 1: Lg/2 + N c . Note that 
of the N c 2 possible values for floor[ (p kq - p, q )IN c }, about half are 0. Hence we have 

m ra , = -m' N - n lq + n kq . + L g /(2N C ) (6) 



These values are quickly calculable. 
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The r matrix is calculated in the next section for all values m by exploiting the FFT. Notice 
that the calculation of the C-matrix elements requires only a small subset of the r matrix 
elements. 

3. Using the FFT to Calculate the r-matrix Elements 

In the previous section it was shown that the r-matrix elements can be represented as a 
convolution. This fact is here exploited to calculate the r-matrix elements using the FFT 
convolution theorem. From Equation (4) the r-matrix elements are 

rjm]^-i-X c >]-cJ«-/n] (7) 

l n=0 

where N - 256. Three streams are related by this equation. In order to apply the 
convolution theorem all three streams must be defined over the same time interval. The 
code streams c k [n] and c(n] are non-zero from n = 0:255. These intervals are based on 
the maximum spreading factor. For higher data-rate users the intervals over which the 
streams are non-zero are reduced further. We are concerned here, however, with the 
S{ intervals derived from the highest spreading factor since these will be the largest intervals 

% and we wish to define a common interval for all streams. The common interval allows the 

S FFTs to be reused for all user interactions. 



! I Cfcffl - ) H = 256 

j Ctln-nyJ " q.[n-rn m , x ] j 

r " " > ! 

! | ;--.c,M ~l N,= 128 i 

n = -256 n = 0 n = 255 

Figure 1. Interval for FFT calculation of the r matrix elements. Shown 
For the case where N k = 256 and N, = 128. 

The range of values m for which T, k [m] is non-zero can be derived from the above 
intervals. The maximum value of m is limited by n-m>0, which gives 

255-m max =0 => m max =255 (8) 

and the minimum value is limited by n-m< 255, which gives 

0-/n min =255 => m min =-255 (9) 

To achieve a common interval for all three streams we select the interval m = -M/2: M/2 - 
1,M = 512. Where necessary the streams are zero-padded to fill up the interval. 
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Now, the DFT and IDFT of the streams are 



which gives 



(11) 



2N l M , = m */ 

r " 2 2 "2 

= — — V Q [r] • C,\r\e l7mr ' M 
2N,M ~m 
r 2 

Hence T, k [m] can be calculated using the FFTs. Notice that the FFT gives values for all m. 
From the analysis above we know that many of these values will be zero for high data rate 
users. To conserve memory we wish to store only the non-zero values. The values of m 
for which Y, k [m] is non-zero can be determined analytically. This subject is treated in the 
next section where the storage and retrieval of the T-matrix elements is considered. 



4. Storage and Retrieval of r-matrix Elements 

In order to efficiently store the T-matrix elements we must determine which values are 
non-zero. For high data rate users certain elements c,[n] are zero, even within the interval 
n = 0:N -1, N = 256. These zero values reduce the interval over which T lk [m] is non-zero. 
In order to determine the interval for non-zero values consider 

r tt [m]s— ^c;[n]c k [n-m] (12) 

2N ; K=0 
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Define index // for the /th virtual user such that c(n] is non-zero only over the interval 
\-Nj-\. Correspondingly, the vector c k [n]\s non-zero only over the interval 



n = j,N,:j l N l - 
n = j k N k :j k N k +N k - 



1 . Given these definitions Ti k [m] can be rewritten as 



r tt [m] = —— X c] [n + j, N, ] • c k [n + j,N, - m] 
The minimum value of m for which T, k [m] is non-zero is 

and the maximum value of m for which T lk [m] is non-zero is 

m max2 =N,-l-j k N k +j,N l 
The total number of non-zero elements is then 



(13) 



(14) 



(15) 



•/ = m m3x 2 ~ m mi 



(16) 



Table 1 below gives the number of bytes per l,k virtual-user pair based on 2 bytes per 
element - one byte for the real part and one byte for the imaginary part. 





N k =256 


128 


64 


32 


16 


8 


4 


N, = 256 


1022 


766 


638 


574 


542 


526 


518 


128 


766 


510 


382 


318 


286 


270 


262 


64 


638 


382 


254 


190 


158 


142 


134 


■ 32 


574 


318 


190 


126 


94 


78 


70 


|. 16 


542 


286 


158 


94 


62 


46 


38 


8 


526 


270 


142 


\ 78 


46 


30 


22 


4 


518 


262 


134 


70 


38 


22 


14 



Now we are in a position to determine the memory requirements for the r matrix for a 
given number of users at each spreading factor. Let there be K q virtual users at spreading 
factor N q = 2 8_? , q = 0:6, where K q is the qth element of the vector K. Note that some 
elements of K may be zero. Let Table 1 above be stored in matrix M with elements M qq -. 
For example, M 00 = 1 022, and M 0i = 766. The total memory required by the r matrix in 
bytes is then 



z g=o I ¥=o J 



(17) 
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For example, for 200 virtual users at spreading factor N 0 = 256 we have K q = 2008 q0 , 
which gives M bytes = VzK Q {K 0 + 1 )M 00 = 1 00(201 )(1 022) = 20.5 MB. 

For 1 0 384 Kbps users we have K q = K 0 8 q0 + K 6 8 q6 with K 0 = 1 0 and K e = 640. This gives 
M bytes = y 2 Ko(K 0 + 1)M 00 + K 0 K 6 M 0S + VzK^Ke + 1)M 66 = 5(11)(1022) + 10(640)(518) + 
320(641)(14) = 6.2MB. 

Now consider addressing, storing and accessing the r-matrix data. For each pair (l,k), k 
>= I we have 1 complex value r, k [m] value for each value of m, where m ranges from m min2 
to m max2 , and the total number of non-zero elements is m tot ai = rrimax2 - m min2 + 1. Hence for 
each pair (l,k), k >= I we have 2/T7 /ofa , time-contiguous bytes. To access the data, create an 
array of structures: 



struct { 

int m_min2; 

int m_max2; 
t* int m_total; 

H char * Glk; 

□ } G_info[N_ VU_MAX][ N_ VU_MAX]; 

ijS The C-matrix data is then retrieved using something like: 

til m m in2 = G_info[l][k].m_min2 

P mmx2 = G_info[l][k].m_max2 

Ng = L/N c 

3 N1 = m'*N - Lg/(2N C ) 

W form' =0:1 
H forq = 0:L-1 

•P forq'=0:L-1 

•r T = /77T+X/ g - X kQ ' 

,U /7W = N1-n, q +n kq ' 

m max1 = m min i + N g 

rrimin = max[ /T7 m/n? , /rw^ ] 

tw = min[ nwr , m max2 ] 

sum1 = 0.0; 

ptrl = &G_info[l][k].GIk[m min ] 
ptr2 = &g[ m m!n *N c +%] 
while m S p a n > 0 

sum1 += ( *ptr1++ ) * ( *ptr2++ ) 

end 

C[m'][l][k][q][q'] = sum1 
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5. Estimated Processing Times 

The following processing times are estimated below: 

• Calculate r-matrix elements 

• Write to r-matrix elements to SDRAM 

• Pack r-matrix elements in SDRAM 

• Extract r-matrix elements/Form C-matrix from SDRAM 

• Write C-matrix elements to L2 cache 

• Pack C-matrix elements in L2 cache 

Processing times are calculated for two cases of interest. The first case is where K= 100 
users (K v = 200 virtual users) are accessing the system and a voice user is added to the 
system. Not all of these users are active. The control channels are always active, but the 
data channels have activity factor AF = 0.4. The mean number of active virtual users is 
then K + AF*K = 140. The standard deviation is a = ^K- AF-Q.-AF) =4.90. With high 
probability, then, we have K v < 140 + 3a < 155 active users. 

Sf The second case is the worst case scenario. This occurs when a number of voice users 

y are accessing the system and a single 384 Kbps data user is added. A single 384 Kbps 

% data user adds interference equal to (.25 + 0.1 25*1 00)/(.25 + 0.400*1) ~= 20 voice users. 

Hence, the number of voice users accessing the system must be reduced to 

5 approximately K= 100 - 20 = 80 (K v = 160). The 3o number of active virtual users is then 

in 80 + (0.125)80 +3(3.0) = 99 active virtual users. The reason this scenario is stressful is 

' that when a single 384 Kbps data user is added to the system, J + 1 = 64 +1 = 65 virtual 

3 users are added to the system. 



Calculate r-matrix elements 

The r-matrix elements can be calculated in one of two ways. The first is using the SAL 
zconvx to perform the direct convolution. The second is using the SAL fft_zipx to perform 
the calculation via the FFT. The first method is preferable when the vector lengths are 
small. SAL timing are given in Table 2. These timings are based on a 400 MHz PPC7400 
with 160MHz, 2MB L2 cache. The data is assumed resident in L1 cache. The 
performance loss for data L2 cache resident is not severe. 



Table 2. SAL timings and GFLOPS for zconvx function 



Mtotal 


N| 


Timing (us) 


GFLOPS 


1024 


4 


19.33 


1.70 


1024 


8 


29.73 


2.20 


1024 


16 


50.55 


,_ 2.59 


1024 


32 


92.32 


2.84 


1024 


64 


176.53 


2.97 


1024 


128 


346.80 


3.47 



The time to perform a 512 complex FFT, with in -place calculation (fft_zipx), on a 400 MHz 
PPC7400 with 160MHz, 2MB L2 cache is 10.94 \is for data L1 resident. Prior to 
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performing the (final) FFT we must perform a complex vector multiply of length 512. The 
SAL timings for zvmulx are given in Table 3. 



Table 3. SAL timings and GFLOPS forzvmulx function 



Length 


Location: 


Timing (fxs) 


GFLOPS 


1024 


L1 


4.46 


1.38 


1024 




24.27 


0.253 


1024 


DRAM 


61.49 


0.100 



We will also be interested in the time to move data. Hence the SAL timings forzvmovx are 
given in Table 4. 

Table 4. SAL timings forzvmovx function 



Length 


Location 


Timing (u.s) 


1024 


L1 


1.20 


1024 


L2 


15.34 


1024 


DRAM 


30.05 



Figure 2 shows the elements that must be calculated (in gray) when a physical user is 
added to the system. When a physical user is added to the system there are 1 + J virtual 
users added to the systems: that is, 1 control channel + J = 256/SF data channels. The 
number K v represents the number of virtual users that are using the system to begin with. 

T"„ Columns k , T 



4 



Figure 2. Elements that must be calculated (in gray) 
when a physical user is added to the system. 

Hence there are {K v + 1) elements added due to the control channel, and J{K V + 1) + J(J + 
1)/2 elements added due to the data channels. The total number of elements added is 
then (J+ 1 )[/<„+ 1 + J/2]. 
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Suppose that the FFT is used to perform the calculations. The total number of FFTs to 
perform is ( J + 1 ) + ( J + 1 )[K V + 1 + J/2]. The first term represents the FFTs to transform 
Ck[n], and the second term represents the (J + A)[K V + 1 + J/2] inverse FFTs of 
FFT{c*M*FFT{c/M. The time to perform the complex 512 FFTs is 10.94 us, whereas 
the time to perform the complex vector multiply and the complex 512 FFT is 24.27/2 + 
10.94 = 23.08 us. 

For the first scenario there are K v = 200 virtual users accessing the system and a voice 
user is added to the system (J = 1). The total time to add the voice user is then (1 + 
1)(10.94 us) + (1 + 1)[200 + 1 + 1/2](23.08 us) = 9.3 ms. 

For the second scenario there are K v = 160 virtual users accessing the system and a 384 
Kbps data user is added to the system (J= 64). The total time to add the 384 Kbps user is 
then (64 + 1)(10.94 us) + (64 + 1)[160 + 1 + 64/2](23.08 us) = 290 ms! This number is 
way too big and hence for high data-rate users, at least, the r-matrix elements must be 
calculated via convolutions. 

U The direct method to calculate the r-matrix elements is to use the SAL zconvx function to 

p perform the convolution 



1 fc 1 . 

T lk [m] = — 2^ c, [n + j, N,] c k [n+ j, N, - m] 
= — X c > + h N k +m]- c k {n + j k N k ] 



(18) 



For each value of m there are N mm = min{A//, N k } complex macs (cmacs). Each cmac 
requires 8 flops, and there are n?, oia/ = N, + N k - 1 m-values to calculate. Hence the total 
number of flops is 8A/™„(/V, + N k - 1). For what follows we assume the convolution 
calculation is performed at 1.50 GOPs = 1500 ops/us. The calculation time to perform the 
convolutions is presented in Table 5. 





N k = 256 


128 


64 


32 


16 


8 


4 


N =256 


697.69 


261 .46 


108.89 


48.98 


23.13 


11.22 


5.53 


128 


261.46 


174.08 


65.19 


27.14 


12.20 


5.76 


2.79 


64 


108.89 


65.19 


43.35 


16.21 


6.74 


3.03 


1.43 


€32 


48.98 


27.14 


16.21 


10.75 


4.01 


1.66 


0.75 


16 


23.13 


12.20 


6.74 


4.01 


2.65 


0.98 


0.41 


8 


11.22 


5.76 


3.03 


1.66 


0.98 


0.64 


0.23 


4 


5.53 


2.79 


1.43 


0.75 


0.41 


0.23 


0.15 



The shaded cells indicate times faster than the 23.08 us FFT time. Equation 1 7 gives the 
size of the r-matrix in bytes. Similarly, the total time to calculate the r-matrix is 
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?=0 I Z ?'=9+l J 

= {iU q T q ^±K q K q T qq \ (19) 

= ^[K-diag(T) + K T T K] 

where T w are the elements in Table 5. Now suppose K' = K+A, where A q = J x 8 qx + J y 8 qy , 
where x and y are not equal. Then 



AT r =T r (lC)-T r (K) 

= -J x (J x +1)^ +]-J v {J y +l)T yx +J x JJ xy .+j[ J K q {jJ xq +J y T yq ] 
2 2 9= o 



(20) 



For the first scenario there are K v = 200 virtual users accessing the system and a voice 
user is added to the system (J= 1). Hence we have K q = K v 8 q0 (SF = 256), K v = 200, J x = J 
= 2 and J y = 0. The total time is then 

V 2 J(J+ 1)T 00 + JK V T 00 = (0.5)(2)(3)(0.70 ms) + (2)(200)(0.70 ms) = 283 ms 

This number is way too big and hence for voice users, at least, the r-matrix elements 
must be calculated via FFTs. 

For the second scenario there are K v = 160 virtual users accessing the system and a 384 
Kbps data user is added to the system (J = 64). Hence we have K q = K v S q0 (SF = 256), K v 
= 1 60, J x = 1 (control) and J y = J = 64 (data). The total time is then 

(K v + 1 ) Too + J(K V + 1 ) T 06 + ( J + 1 )( J/2) T m 

= (161)(697.7 us) + (64)(161)(5.53 us) + (65)(32)(0.15 [is) = 

1 1 2.33 ms + 56.98 ms + 0.31 ms = 1 69.62 ms 

Since Too = 697.7 us is so large, these calculations should be performed using the FFT, 
which costs 23.08 us per convolution. We also have 1 FFTs to compute FFT{c**M) for 
the single control channel. This costs an additional 10.94 \is. The total time, then, to add 
the 384 Kbps user is 

10.94 lis + (161)(23.08) [is + (64)(1 61)(5.53) us + (65) (32) (0.1 5) us = 
= 61 .02 ms 
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Write to r-matrix elements to SDRAM 

The numbers in Table 1 represent the 2m tota , bytes per r-matrix element. Recall that the 
size of the r-matrix in bytes from Equation 17 is 

= \lU,M„*tv,M„] (21) 
= i [k ■ diag(M) + K T M -k] 

Now suppose K' = K+ A, where A q = J x 8qx + J y 8 qy , where x and y are not equal. Then 

AM„^M b (K')-M b (K) 

S = i 7, (7, + 1)M„ +ij , (7, + 1)M „ + J^M,, (22) 

^0 +J j K g {j x M xg +J y Mj 

tfi ■ ?=° 

Pi Consider the first scenario where K q = 200S q0 (SF = 256) and that a single voice user is 

added to the system: J x = 2 (data plus control), and J y = 0. The total number of bytes is 
O then 0.5(2)(3)(1022) + 200(2)(1022) = 0.412 MB. The SDRAM write speed is 133MHz*8 

W bytes * 0.5 = 532 MB/s. The time to write to SDRAM is then 0.774 ms. 

Now for the second scenario K q = 1605 q0 (SF = 256), and that a single 384 Kbps (SF = 4) 
user is added to the system: J x = 1 (control) and J y = 64 (data). The total number of bytes 
is then 0.5(1 )(2)(1 022) + 0.5(64)(65)(14) + 160{1(1022) + 64(518)} = 5.498 MB. The 
SDRAM write speed is 133MHz*8 bytes * 0.5 = 532 MB/s. The time to write to SDRAM is 
then 10.33 ms. 



Pack r-matrix elements in SDRAM 

The maximum total size of the r-matrix is 20.5 MB. Suppose that in order to pack the 
matrix every element must be moved. This is the worst case. The SDRAM speed is 
133MHz*8 bytes * 0.5 = 532 MB/s. The move time is then 2(20.5 MB)/(532 MB/s) = 77.1 
ms. If the r-matrix is divided over three processors this time is reduced by a factor of 3. 
The packing can be done incrementally, so there is no strict time limit. 
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Extract r-matrix elements/Form C-matrix from SDRAM 



Recall that the C-matrix data is retrieved using something like: 

m min2 = G_info[l][k].m_min2 
m maX 2 = G_info[l][k].m_max2 

Ng = Lg/Nc 

N1 = m'*N - Lg/(2N C ) 
form' =0:1 

for q = 0:L -1 

forq' = 0:L-1 

1 = m'T+ Xi q — Tkq- 

m min i = N1 - riiq + n kq - 

mrnaxi = mmiM + N g 

mmin = max[ m min i , m min2 ] 
mmax = min[ rrwr , m max2 ] 

y, if mmax >= m mjn 

**j m S pan = mmax — m min + 1 

n sum1=0.0; 

J| ptrl = &G_info[l][k].GIk[m min ] 

S ptr2 = &g[ m min *N c +x] 

43 while m span > 0 

p sum 1 += ( *ptr1++ ) * ( *ptr2++ ) 

IF! m span 

? end 

O C[m'][IJ[k][q][q'J = sum1 

W end 



Time to extract elements when a new user is added to the system 

We calculated above the time to calculate the r-matrix elements when a new user is 
added to the system. Here we consider the time to extract the corresponding C-matrix 
elements. 

Notice that Glk[m] are accessed from SDRAM. Values will almost certainly not be in either 
L1 or L2 cache. For a given (l,k) pair, however, the spread in x will for most cases be less 
than 8 (as (i.e for a 4 us delay spread), which equates to (8 us)(4 chips/us)(2 bytes/chip) = 
64 bytes, or 2 cache lines. Since data must be read in for two values of m' a total of 4 
cache lines must be read. This will require 16 clocks, or about 16/133 = 0.12 [is. 
However, measured results for zvmovx indicate that accesses to SDRAM are performed 
at about 50% efficiency so that the required time is about 0.24 [is. 

Now suppose, for example, user / = x is added to the system. We must fetch the elements 
C[m'][x][k][q][q'J for all m', k, qand q'. As indicated above, all the m\ qand q' values will 
be contained typically in 4 cache lines. Hence if there are K v virtual users we must read in 
4K V cache lines, or 32K V clocks, where we have doubled the clocks to account for the 50% 

13 

Page No. 173 



EV 093 931 868 US 
Page No. 200 

efficiency. In general J + 1 virtual users are added to the system at a time. This will 
require 32/C(J+ 1) clocks. 

For the first case where we have 155 active virtual users and a new voice user is added 
to the system, the time required to read in the C-matrix elements will be 32(155)(1 + 1) 
clocks/(133 clocks/us) = 74.6 us. The industry standard hold time Mor a voice call is 140 
s. The average rate X of users added to the system can be determined from Xt h = K, 
where Kis the average number of users using the system. For K= 100 users we have X = 
1 00/1 40 s = 1 users added per 1 .4 s. 

For the case where we have 99 active virtual users and a 384 Kbps user is added to the 
system, the time required to read in the C-matrix elements will be 32(99)(64 + 1) 
clocks/(133 clocks/u.s) = 1.55 ms. However data users presumably will be added to the 
system more infrequently than voice users. 



Time to extract elements when changes 
u Now suppose, for example, user / = x lag q = y changes. Then we must fetch the elements 

A C[m'][x][k][y][q'J for all m', k and q'. All the q' values will be contained typically in 1 cache 

q line. Hence we must read in 2(/Q(1) = 2K V cache lines, or 16K V clocks, where we have 

5 doubled the clocks to account for the 50% efficiency. In general, when a lag changes 

yp there are J + 1 virtual users for which the C-matrix elements must be updated. This will 

*G require 1 6Kv(J + 1 ) clocks. 

Q 

CP For the first case where we have 155 active virtual users and a voice user's profile (one 

* lag) changes, the time required to read in the C-matrix elements will be 16(155)(1 + 1) 

Q clocks/(133 clocks/us) = 37.3 us. Recall that for high mobility users such changes should 

W occur at a rate of about 1 per 100 ms per physical user. This equates to about once per 

1 .33 ms processing interval if there are 100 physical users so that approximately 37.3 |xs 
jj* will be required every 1 .33 ms. 

t« For the case where we have 99 virtual users and a 384 Kbps data user's profile (one lag) 

changes, the time required to read in the C-matrix elements will be 16(99)(64 + 1) 
clocks/(133 clocks/us) = 0.774 ms. However data users will have lower mobility and hence 
such changes should occur infrequently. 



Write C-matrix elements to L2 cache 

Time to write elements when a new user is added to the system 

Consider again the case where user / - x is added to the system. We must write elements 
C[m'][x][k][q][q'] for all m', k, q and q'. If there are K v active virtual users we must write 
4K V L 2 bytes, where we have doubled the bytes since the elements are complex. In general 
J + 1 virtual users are added to the system at a time. This will require 4K V L 2 {J + 1 ) bytes to 
be written to L2 cache. 

For the first case where we have 155 active virtual users and a new voice user is added 
to the system, the time required to write the C-matrix elements will be 4(155)(16)(1 + 1) 
bytes/(2128 bytes/u.s) = 9.3 |xs. 
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For the second case where we have 99 active virtual users and a 384 Kbps user is added 
to the system, the time required to write the C-matrix elements will be 4(99)(16)(64 + 1) 
bytes/(2128 bytes/us) = 193.5 us. Recall, however, that data users presumably will be 
added to the system more infrequently than voice users. 



Time to extract elements when x m changes 

Now suppose, for example, user / = x lag q = y changes. We must write elements 
C[m'][x][k][q][q'] for all m', A- and q'. If there are K v active virtual users we must write 4K V L 
bytes, where we have doubled the bytes since the elements are complex. In general J+ 1 
virtual users are added to the system at a time. This will require 4K V L{J + 1) bytes to be 
written to L2 cache. 

For the first case where we have 155 active virtual users and a voice user's profile (one 
lag) changes, the time required to write the C-matrix elements will be 4(155)(4)(1 + 1) 
bytes/(2128 bytes/us) = 2.33 us. 

i*l For the second case where we have 99 active virtual users and a 384 Kbps data user's 

•j profile (one lag) changes, the time required to write the C-matrix elements will be 

II 4(99)(4)(64 + 1)bytes/(2128 bytes/us) = 48.4 us. However data users will have lower 

0 mobility and hence such changes should occur infrequently. 

Q1 Pack C-matrix elements in L2 cache 

P The C-matrix elements will need to be packed in memory every time a new user is added 

W to or deleted from the system and every time a new user becomes active or inactive. The 

H= size of the C-matrix is 2{3/2)(K v Lf = 3{K V L) 2 bytes, however, divided over three 
processors this becomes (K V L) 2 bytes per processor. Assume that the entire matrix must 

O be moved. The move is within L2 cache. Hence the total move time is 2{K V L) 2 bytes/(2128 

" J bytes/jxs), where the factor of 2 accounts for read and write. 

For the first case where we have 155 active virtual users the time required to move the C - 
matrix elements will be 2(155*4) 2 bytes/(2128 bytes/us) = 0.361 ms. 

For the first case where we have 99 active virtual users the time required to move the C- 
matrix elements will be 2(99*4) 2 bytes/(2128 bytes/us) = 0.147 ms. 

These events will occur typically once every 10 ms, that is, once per frame. 
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6. Summary and Conclusions 

In summary, we have determined 

• The r-matrix will require approximately 20.5 MB of SDRAM 

• To efficiently calculate the r-matrix elements will require both direct convolution and 
FFT calculations 

• To pack the r matrix in SDRAM will require approximately 77.1 ms 



The following processing times are estimated: 



Estimated Processing Times 


Case 1 


Case 2 


(voice user added) 


(384 Kbps user added) 


Calculate r-matrix elements 


9.3 ms 


61 .0 ms 


Write r-matrix elements to SDRAM 


0.77 ms 


10.3 ms 


Extract C-matrix elements when 






New user added 


75 us 


1.6 ms 


Multipath profile changes 


37 jis 


0.77 ms 


Write C-matrix elements to L2 when 






New user added 


9.3 (as 


194 us 


Multipath profile changes 


2.3 |is 


48 us 


Pack C-matrix elements in L2 cache 


361 [is 


147 us 



These times are based on a single but devoted G4 allocated to perform the calculations. 
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"\; The C-matrix elements can be represented in terms of the underlying code 

correlations using 

S 1 C %? .[m'] = -L J jSk [nN c +m'T+f lq -f^-c^n] 



2Nl n p 

= ^^Y i g[mN c +r]c k [n-m]c* l [n\ 

= X 8[mN c + T] • £cj [n] -c^n-m] ( 1 ) 

= ^g[mN c +t]-r ik [m] 



The r-matrix represents the correlation between the complex user codes. The 
complex code for user / is assumed to be infinite in length, but with only M non- 
zero values. The non-zero values are constrained to be ±1± j . The r-matrix can 
represented in terms of the real and imaginary parts of the complex user codes 
becomes 
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m 



T tk [m] = -~ • c k [n - m] 

+ ycf [n] • c/ [n-m]- jc\ [n] ■ c R k [n - m]} 
= T ; f [m] + r,f [m] + jfcf [m] - C [m]} 



(3) 



h* rr[rn[mJ-2,cnn\-c'[n-m\ 
Q 1Nl " 

S C[m] = -i-£c/M-ci[n-m] 

Consider any one of the above real correlations, denoted 

rf [m] = — W • c[ [« - «] (4) 

where X and Y can be either R or /. Since the elements of the codes are now 
constrained to be ± 1 or 0, we can define 

c*[n] = (l-2y *[«])•<[«] (5) 

where y*[n] and mf[n\ are both either zero or one. The sequence m?\n\ is a 
mask used to account for values of cf[n\ that are zero. With these definitions 
Equation (4) becomes 
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T,f [ m ] s JL £fc - 2 7 * [»])• m ; x [«] • (l - 2yJ [« - m]> m[ [n - m] 
2/V, „ 

= £ (l - 2y* [!»])• (l - 2yJ [n - m])- m* [n] • [n - m] 

- 2^ (r* [«] © y* r t« - »»* w • ™r [« - «l| 

= — {M,f [m]-2W5[m]} 
M,f [m] = X - [n] • m[ [n - m] 

[m\ = X (r* [«i ® r[ [« - «])• < w • ^[ [« - 



(6) 



O 



Q 



where © indicates modulo-2 addition (or logical XOR). 

The hardware to perform these operations is shown in Figures 1 - 3. Figure 1 
shows the initial register configuration after loading code and mask sequences. 
The boolean functions are shown in Figure 2, and Figure 3 shows the register 
configuration after a number of shifts. 



il here (256 chips) 



Load mask & code for 
user k here (256 chips) 



: 



[TrrnTf 



iniiiin 



TV 



- Boolean operations 



Y=t Output 



Figure 1. Initial register configuration after loading code and mask sequences. 
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Figure 2. Boolean functions. 



P 

01 



IrffiE, ffl Effi 



7~ 

Load zeros m from left 




imiiiiiiiiiiiiii 



Perform a total of 512 shifts, 
shifting mask k and code k 
out of registers at right. 



I l l T ITU I I I I til I t I I I I 



L 



J=t Output 



Figure 3. Register configuration after a number of shifts. 

The above hardware calculates the functions M,f[m] and N™[m]. The 
remaining calculations to form r,f [m] and subsequently r ik [m] can be 
performed in software. Note that the four functions r,f [m] corrsponding to X, Y = 
R, / which are components of r lk [m] can be calculated in parallel. For K v = 200 
virtual users, and assuming that 10% of all (/, k) pairs must be calculated in 2 ms, 
then for real-time operation we must calculate 0.10(200) 2 = 4000 T, k [m] elements 
(all shifts) in 2 ms, or about 2M elements (all shifts) per second. For K v = 128 
virtual users the requirement drops to 0.81 92M elements (all shifts) per second. 

In what has been presented ther ft [m] elements are calculated for all 512 shifts. 
Not all of these shifts are needed, so it is possible to reduce the number of 
calculations per r lk [m] elements. The cost is increased design complexity. 
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.a .c .mac .o 



###CFLAGS = -Ot -t ${ARCH} -I. -DCOMPILE_C 

C FLAGS = -Ot -t ${ARCH} -I. 

AS FLAGS = -t ${ARCH} -DBUILD_MAX -I. 

# 

# Make object files 
# 

.CO: 



ccmc ${CFLAGS} 
Make ASM 



rm -f $* . S 

cp $*.mac $*.S 

ccmc $ {AS FLAGS } -o $*.o -c $*.S 
rm -f $*.S 



; m objs = \ 

*2 get sizes. o \ 

SU get sizes v.o \ 

yr| reformat corr.o \ 

. $*j rmats.o \ 

reforraat_r.o \ 
W mpic.o \ 

01 gen x row.o \ 

gen r sums . o \ 

Lb 9 en r sums2 • ° \ 

gen r matrices, o \ 
y,.| mtrans32 8bit.o \ 



mtriangle 8bit.o \ 
dotpr3 8bit.o \ 
dotpr6 8bit.o \ 
dotpr9 8bit.o \ 
sve3 8bit.o \ 
fixed cdotpr.o \ 
zdotpr4 vmx.o \ 
zdotpr_vmx . o 

${MUDLIB}: Makefile ${OBJS} 
armc -c $@ ${0BJS} 

# 

# Cleanup 
# 

clean : 

rm -f ${0BJS} *.S ${MUDLIB} 

get sizes. o: mudlib.h get_sizes.c 
reformat_corr.o: mudlib.h ref ormat_corr . c 
rmats.o: mudlib.h rmats.c \ 

gen x row. mac gen r_sums.mac gen_r_sums2 .mac 

gen r matrices. mac 
reformat r.o: mudlib.h reformat_r.c 
mpic.o: mudlib.h mpic.c \ 

dotpr3 8bit.mac dotpr6_8bit .mac dotpr9_8bit .mac 

sve3_8bit .mac 

dotpr3 8bit.o: dotpr3 8bit.mac salppc . inc 
dotpr6_8bit .o: dotpr6_8bit .mac salppc . inc 
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dotpr9 8bit.o: dotpr9 8bit.mac salppc.inc 

sve3 8bit.o: sve3 8bit.mac salppc.inc 

fixed cdotpr.o: zdotpr4 vmx.mac salppc.inc 

zdotpr4_vmx . o : zdotpr4_vmx .mac zdotpr4_vmx.k salppc.inc 
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#include "mudlib.h" 



DO CALC STATS 
DO TRUNCATE 1 
DO SATURATE 1 
DO_SQUELCH 0 



#def ine 
#def ine 
#define 
#def ine 

#def ine 

#define 



#if DO TRUNCATE 
#define SATURATE_THRESH 
#else 

#define SATURATE_THRESH 
ttendif 



(128.0 ■ 
127.5 



TRUNCATE_BI AS ) 



f ) \ 



#define SATURATE ( 

if ( (f) >= SATURATE THRESH ) f = (SATURATE THRESH - 1.0); \ 
else if ( (f) < - SATURATE_THRESH ) f = -SATURATE_THRESH ; \ 



W 



#define BF8_FIX ( f ) 



#def ine 
#else 
#def ine 
\ 

#endif 

#else 
#define 
#endif 



BF8_FIX( f ) 
BF8_FIX( f ) 



( (BF8) (FABS(f) <= TRUNCATE BIAS) ? 0 : \ 
(((f) > 0.0) ? ((f) - TRUNCATE BIAS) : \ 
((f) + TRUNCATE_BIAS) ) ) 

( (BF8) (f) ) 



( (BF8) (((((f) < 0.0) ) && ( (f ) 
((f) +1.0) : (f) ) ) 



(float) ( (int) (f) ) ) ) ? 



BF8_FIX( f ) ( (BF8) ( ( (f) 

ttdefine UPDATE MAX ( f, max ) \ 

if ( FABS ( f ) > max ) max = FABS ( f ) ; 

#define uchar unsigned char 
#define ushort unsigned short 
#define ulong unsigned long 

#if DO_CALC STATS 

static float max_R_value; 

#endif 

void gen X row ( 

COMPLEX BF16 *mpathl bf, 
COMPLEX BF16 *mpath2_bf , 
COMPLEX BF16 *X_bf, 
int phys index, 
int tot_phys_users 

) ; 

void gen R sums ( 

COMPLEX BF16 *X bf, 
COMPLEX BF8 *corr_bf, 
uchar *ptov map, 
BF32 *R sums, 
int num_phys_users 



0.0) ? ((f)+0.5) 



) ; 

void gen_R_sums2 ( 
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COMPLEX BF16 *X bf, 
COMPLEX BF8 *corra bf, 
COMPLEX BF8 *corrb_bf, 
uchar *ptov map, 
BF32 *R sumsa, 
BF32 *R sumsb, 
int num_phys_users 

) ; 

gen R matrices ( 
BF32 *R sums, 
float *bf scalep, 
float *inv scalep, 
float *scalep, 
BF8 *no scale row bf, 
BF8 *scale row bf, 
int num_virt_users 

) ; 

mudlib gen R ( 
COMPLEX BF16 
COMPLEX BF16 
COMPLEX BF8 
COMPLEX BF8 
uchar 
float 
float 
float 
char 



int 
int 
int 



*mpathl bf, 
*mpath2 bf, 
*corr 0 bf, 
*corr_l_bf , 
*ptov map, 
*bf scalep, 
*inv scalep, 
*scalep, 
*L1 cachep, 
*R0 upper bf , 
*R0 lower bf , 
*R1 trans_bf, 
*Rlm bf, 
tot phys users, 
tot virt users, 
start phys user, 
start virt user, 
end phys user, 
end virt user 



/* adjusted for starting physical user */ 

/* adjusted for starting physical user */ 

/* no more than 256 virts. per phys */ 

/* scalar: always a power of 2 */ 

/* start at 0 ' th physical user */ 

/* start at 0 1 th physical user */ 

/* temp: 32K bytes, 32-byte aligned */ 



/* zero-based starting row (inclusive) * 
/* relative to start phys user */ 
/* zero-based ending row (inclusive) */ 
/* relative to end_phys_user */ 



COMPLEX BF16 *X bf; 
BF32 *R sumsO, *R sumsl; 
uchar *R0_jptov_map; 

int bump, byte offset, i, iv, last virt user; 

int R0_align, RO_skipped_virt_users , R0_tcols, R0_virt_users, Rl_tcols; 

#if DO CALC STATS 

max R_value = 0.0; 
#endif 

X_bf = (COMPLEX_BF16 *)Ll_cachep; 

byte offset = tot phys users * NUM FINGERS SQUARED * sizeof (COMPLEX BF16) ; 
R_sums0 = (BF32 *)(((ulong)X bf + byte_offset + R_MATR I X_AL I GN_MA S K ) & 
~ R_MATRI X_AL I GN_MAS K ) ; 

byte offset = tot virt users * sizeof (BF32) ; 

R_sumsl = (BF32 *)(((ulong)R sumsO + byte_offset + R_MATRIX_ALIGN_MASK) & 
~R MATRIX_ALIGN_MASK) ; 



irt users 



R MATRIX_ALIGN_MASK) & ~R_MATRIX_ALIGN_MASK; 



2 
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RO vi: 

for ( i = start_phys user; i < tot phys_users; i++ ) { 
RO virt users += (int)ptov mapEi]; 
R0_ptov_map [i] = ptov_raap [i] ; 

RO ptov map [start phys user] -= start virt user; 

RO skipped virt users = tot virt users - R0_virt_users + start_virt_user ; 
RO_virt_users -= ( start_virt_user + 1) ; 

--inv_scalep; /* predecrement to allow for common indexing */ 

for ( i = start_phys_user; i <= end_phys_user ; i++ ) { 

gen X row ( 
mpathl bf, 
mpath2_bf , 
X bf, 
i, 

tot_phys_users 

) ; 

--R0_ptov_map [i] ,- /* excludes RO diagonal */ 

last_virt_user = (i < end_phys_user) ? ((int)ptov map[i] - 1) : 

end_virt_user; 

for ( iv = start_virt_user ; (iv + 1) <= last_virt_user ; iv += 2 } { 

gen R sums2 { 

X bf + (i * NUM_FINGERS_SQUARED) , 
corr 0 bf, 

corr 0 bf + ( (R0_virt_users - 1) * NUM_FINGERS_SQUARED) , 
RO ptov_map + i, 

R sumsO + (RO skipped virt users + 1) , 
R sumsl + (RO skipped_virt_users + 1) , 
tot_phys_users - i 

) ; 

RO tools = Rl tcols - (RO skipped_virt users & ~R MATRI X_AL I GN_MASK ) ; 
R0_align = (R0_skipped_virt_users & R_MATR I X_AL I GN_MAS K ) + 1 ; 

gen R matrices ( 

R sumsO + (RO_skipped_virt_users + 1) , 
bf scalep, 

inv scalep + (RO skipped virt users + 1) , 
scalep + (RO skipped virt_users + 1} , 
RO lower bf + RO align, 
RO upper bf + R0_align, 
R0_virt_users 

) ; 

R0_upper_bf [ R0_align - 1 ] = 0; /* zero diagonal element */ 

RO lower bf += RO tcols; 
R0_upper_bf += R0_tcols; 

R0_tcols = Rl_tcols - ( (RO skipped virt users +1} & 

~R MATRIX ALIGN MASK) ; 
R0_align = ( (RO_skipped_virt_users + 1) & R_MAT R I X_AL I GN_MAS K ) + 1; 

gen R matrices ( 

R sumsl + (R0_skipped_virt_users + 2) , 
bf scalep, 

inv scalep + (RO skipped virt users + 2) , 
scalep + (RO skipped virt_users + 2) , 
R0_lower_bf + R0_align, 



Page No. 189 



EV 093 931 868 US 
Page No. 216 

rmats . c 



R0_upper_bf [ R0_align - 1 ] = 0; /* zero diagonal element */ 

R0 lower bf += R0 tcols; 
R0_upper_bf += R0_tcols; 

/* 

* create ptov map[i] number of 32-element dot products involving 

* X_bf [i] and corr_l_bf [i] [ j ] where 0 < j < ptov_map[i] 
*/ 

gen R sums2 ( 
X bf, 

corr 1 bf, 

corr 1 bf + (tot_virt_users * NUM_FINGERS_SQUARED) , 

ptov map, 

R sumsO, 

R sumsl, 

tot_phys_users 

) ; 

\+ /* 

Q * scale the results and create two output rows (1 per matrix) 

H */ 

gen R matrices ( 
R sumsO, 
C? bf scalep, 

yjj inv scalep + (RO_skipped_virt_users + 1) , 

; ~I scalep, 

fe J Rl trans_bf, 

Ip Rim bf , 

tot virt_users 



.0 



a 



"t = gen R matrices ( 

O R sumsl, 



III 



bf scalep, 
inv scalep + (R0_skipped_> 
scalep, 
Rl trans_bf, 
Rim bf, 

tot_virt_users 

) ; 



Rl to 



corr 0 bf += ( ( (2 * RO virt users) - 1) * NUM FINGERS SQUARED); 

corr 1 bf += ( (2 * tot_virt_users) * NUM_FINGERS_SQUARED) ; 

RO ptov map[i] -= 2 ; 

RO virt users -= 2 ; 

RO skipped_virt_users += 2; 

} 

if ( iv <= last_virt_user ) { 

bump = RO ptov_map [ i ] ? 0 : 1 ; 
gen R sums ( 

X bf + ( (i + bump) * NUM_FINGERS_SQUARED) , 

corr 0 bf, 

RO ptov_map + i + bump, 

R_sums0 + (RO_skipped_virt_users + 1) , 
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RO tcols = Rl tcols - (R0 skipped_virt users & ~R MATRI X_AL I GN_MAS K ) ; 
R0_align = (R0_skipped_virt_users & R_MATR I X_AL I GN_MAS K ) + 1; 

gen R matrices ( 

R sumsO + (RO_skipped_virt_users + 1) , 
bf scalep, 

inv scalep + (RO skipped virt users + 1) , 
scalep + (RO skipped virt_users + 1) , 
RO lower bf + RO align, 
RO upper bf + R0_align, 
RO_virt_users 

); 

RO_upper_bf [ R0_align - 1 ] = 0; /* zero diagonal element */ 

RO lower bf += RO tcols; 
RO_upper_bf += R0_tcols; 

' ' * create ptov map[i] number of 32-element dot products involving 

* X bf [i] and corr_l_bf [i] [j] where 0 < j < ptov_map[i] 

O */ ~~ 

f=j gen R sums ( 

jk X bf, 

corr 1 bf, 
ISJ ptov map, 

,.p R sumsO, 

i*= : tot_phys_users 
) ; 

y t 

* scale the results and create two output rows (1 per matrix) 

U */ 

Eli gen R matrices ( 

U R sumsO, 

' F bf scalep, 

& inv scalep + (RO_skipped_virt_users + 1) , 

Q scalep, 

p-5 f Rl trans_bf , 

5i ™ Rim bf, 

tdt virt users 



Rlm_bf += Rl_tcols; 

corr 0 bf += (RO virt users * NUM FINGERS SQUARED) ; 
corr 1 bf += (tot_yirt_users * NUMJ?INGERS_SQUARED) ; 
RO ptov map[i] -= 1; 
RO virt users -= 1; 
RO_skipped_virt_users += 1; 

start virt user =0; /* for all subsequent passes */ 

} " " 

#if DO CALC STATS 

print f ( "max R value = %f\n", max_R_value ) ; 

if ( max_R_value > 127.0 ) 

printf ( ****** OVERFLOW *****\ n " ); 
#endif 
} 

#if COMPILE_C 



5 
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void gen X row ( 

COMPLEX BF16 *mpathl bf , 
COMPLEX BF16 *mpath2_bf , 
COMPLEX BF16 *X_bf, 
int phys index, 
int tot_phys_users 

) 



{ 



COMPLEX BF16 *in mpathlp, *in mpath2p; 

COMPLEX_BF16 *out_mpathlp, *out_mpath2p; 

int i, j, q, ql; 

BF32 sir, sli, s2r, s2i; 

BF32 air, ali, a2r, a2i; 

BF32 cr, ci; 

out mpathlp = mpathl bf + (phys index * NUM FINGERS) ; 
out_mpath2p = mpath2_bf + (phys_index * NUM_FINGERS) ; 

for ( i = 0; i < tot_phys_users ,- i++ ) { 

in mpathlp = mpathl bf + (i * NUM FINGERS) ; /* 4 complex values */ 
in_mpath2p = mpath2_bf + (i * NUM_FINGERS) ; /* 4 complex values */ 

j = 0; 

for ( ql = 0; ql < NUM_FINGERS; ql++ ) { 

sir = (BF32)out mpathlp [ql] .real; 

sli = (BF32)out mpathlp [ql] . imag ; 

s2r = (BF32)out mpath2p[ql] .real; 

s2i = (BF32) out_mpath2p [ql] . imag; 

for ( q = 0; q < NUM_FINGERS; q++ ) { 

air = (BF32) in mpathlp [q] .real; 

ali = (BF32)in mpathlp [q] . imag ; 

a2r = (BF32)in mpath2p [q] .real; 

a2i = (BF32) in_mpath2p[q] .imag; 

cr = (air * sir) + (ali * sli) ; 
ci = (air * sli) - (ali * sir) ; 
cr += (a2r * s2r) + (a2i * s2i) ; 
ci += (a2r * s2i) - (a2i * s2r) ; 

X bf [i * NUM FINGERS SQUARED + j] .real = (BF16) (cr >> 16); 
X bf [i * NUM_F I NGERS_SQUARED + j ] . imag = (BF16) (ci >> 16); 



} 



} 



gen R sums ( 
COMPLEX BF16 *X bf, 
COMPLEX BF8 *corr_bf, 
uchar *ptov map, 
BF32 *R sums, 
int num_phys_users 



{ 

int i , j , k ; 
BF32 sum; 

for ( i = 0; i < num phys users; i++ ) { 

for ( j = 0; j < (int)ptov_map [i] ; j++ ) { 
sum = 0; 
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for ( k = 0; k < 16; k++ ) { 

sum += (BF32)X bf [k] .real * (BF32) corr bf - >real ; 
sum += (BF32)X_bf [k] . imag * (BF32) corr_bf ->imag; 
++Corr bf ; 

} 

*R sums++ = sum; 

} 

X bf += NUM FINGERS_SQUARED; 

void gen R sums2 ( 

COMPLEX BF16 *X bf, 
COMPLEX BF8 *corra bf, 
COMPLEX BF8 *corrb_bf, 
uchar *ptov map, 
BF32 *R sumsa, 
BF32 *R sumsb, 
int num_phys_users 

) 

{ 

int i, j, k; 
P= BF32 suma, sumb; 

P % for ( i = 0; i < num phys users; i++ ) { 

y for ( j = 0; j < <int)ptov_map [i] ; j++ ) { 

! <i| suma = 0; 

%fH sumb - 0; 

,'~ for { k = 0; k < 16; k++ ) { 

'if suma += (BF32)X bf [k] . real * (BF32)corra bf->real; 

O suma += (BF32)X bf [k] . imag * (BF32)corra bf->imag; 

m sumb += (BF32)X bf[k].real * (BF32)corrb bf->real; 

sumb += (BF32)X_bf [k] .imag * (BF32) corrb_bf ->imag; 
L ++corra bf; 

bJ ++corrb_bf; 

*R sumsa++ = suma; 
» s= *R sumsb++ = sumb; 

£ } " 

Q X_bf += NUM_FINGERS_SQUARED; 



void gen R matrices ( 
BF32 *R sums, 
float *bf scalep, 
float *inv scalep, 
float *scalep, 
BF8 *no scale row bf, 
BF8 *scale row bf , 
int num__virt_users 

) 

{ 

int i ; 

float bf_scale, fsum, fsum_scale, inv_scale, scale; 



bf scale = *bf scalep; 
inv_scale = *inv_scalep; 

for ( i = 0; i < num_virt_users ; i++ ) { 
scale = scalep [i] ; 
fsum = (float) (R sums [i] ) ; 
fsum *= bf_scale; 

fsum_scale = fsum * inv_scale; 



EV 093 931 868 US 



Q 

m 



fsura_scale *= scale; 

#if DO CALC STATS 

UPDATE MAX( fsum scale, max R_value ) 

UPDATE_MAX< fsum, max_R_value ) 
#endif 

#if DO_SQUELCH 

if { FABS ( fsum_scale ) <= SQUELCH THRESH ) fsum ! 

if ( FABS ( fsum ) <= SQUELCH_THRESH ) fsum = 0.0; 
#endif 

#if DO SATURATE 

SATURATE ( fsum_scale ) 

SATURATE ( fsum ) 
#endif 

no scale row bf[i] = BF8 FIX( fsum ); 
scale row bf [i] = BF8_FIX( fsum_scale ); 

#endif /* COMPILE_C */ 



8 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: dotpr3_8bit .mac 

Description: Source code for routine which computes three 
dot products, combining the three sums prior 
to exit . 

Mercury Computer Systems, Inc. 
Copyright (c) 200 0 All rights reserved 

Revision Date Engineer Reason 

0.0 000510 fpl Created 

0.1 000521 fpl Added num cached_rows 

0.2 000521 fpl Changed to fixed point 

0.3 000605 fpl Changed to .k file 

0.4 000926 jg Back to .mac and no dsts 



ttinclude "salppc.inc" 

ttdefine LVXJ3T ( vT, rA, rB ) 

#define FUNC ENTRY 
#define VMSUM ( vT, vA, vB, vC ) 
#define LOOP COUNT SHIFT 6 
#define HALF BLOCK BIT 0x20 
#define QUARTER_BLOCK_B I T 0x10 



LVX( vT, rA, rB ) 



#defii 



LOOP_BLOCK_SIZE 64 



Input parameters 
**/ 

#define btlmptr r3 
#define rlptr r4 
#define rOptr r5 
#define rlmptr r6 
#define C r7 
ttdefine N r8 
ttdefine hat_tc r9 
/** 

Local loop registers 
**/ 

#define btOptr rlO 
#define btlptr rll 
#define indexl rl2 
#define index2 rl3 



/** 

G4 registers 
**/ 

#define rqlO vO 
#define rqll vl 
ttdefine rql2 v2 
#define rql3 v3 
#define zero v3 

#define rqOO v4 
#define rqOl v5 
#define rq02 v6 
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#define rq03 v7 

#define rqlmO v8 

#define rqlml v9 

#define rqlm2 vlO 

#define rqlm3 vll 

ttdefine btlmO vl2 

ttdefine btlml vl3 

#define btlm2 vl4 

#define btlm3 vl5 

#define btlO vl6 
#define btll vl7 
#define btl2 vl8 
#define btl3 vl9 

#define btOO v2 0 

#define btOl v21 

ttdefine bt02 v22 

#define bt03 v23 

« ? #def ine suraO v24 

ttdefine suml v2 5 
O ttdefine sum2 v2 6 

f=s ttdefine sum3 v2 7 

ill Begin code text 

; Jl Setup loop registers, test for zero N 

™* FUNC PROLOG 

111 ENTRY 7( FUNC_ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, hat_tc ) 

SAVE rl3 

USE_THRU_v27 ( VRSAVE_COND ) 

yj Load up local loop registers 

lI **/ 

ADD(btOptr, btlmptr, hat_tc) 
4» VXORfsumO, sumO, sumO) 

O ADD(btlptr, btOptr, hat_tc) 

L 

!i " LKindexl, 16) 

VXOR(suml, suml, suml) 
LI(index2, 32) 
VXOR ( sum2 , sum2 , sum2 ) 
LI(index3, 48) 
VXOR(sum3, sum3, sum3) 

SRWI C(icount, N, LOOP_COUNT_SHIFT) /* 32 sum updates per loop trip */ 
BEQ (do_half_block) 

/** 

Loop entry code 
**/ 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 
LVX( rql2, rlptr, index2 ) 
LVX( rql3, rlptr, index3 ) 
DECR_C (i count) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, LOOP_BLOCK SIZE) 

LVX BT( btlm2, btlmptr, index2 ) 

LVX BT( btlm3, btlmptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

BR ( mid_loop ) 
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Loop computes three dot products held in 16 parts 
**/ 

LABEL ( loop ) 

/* { */ 

LVX( rqlO, 0, rlptr ) 
VMSUM( sumO, rqlmO, btlO, sumO ) 
LVX( rqll, rlptr, indexl ) 
VMSUM( suml, rqlml, btll, suml ) 
LVX( rql2, rlptr, index2 ) 
VMSUM( sum2, rqlm2, btl2, sum2 ) 
LVX( rql3, rlptr, index3 ) 
DECR_C < i count ) 

LVX BT( btlmO, 0, btlmptr ) 

VMSUM( sum3, rqlm3 , btl3 , sum3 ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, LOOP_BLOCK SIZE) 

LVX BT( btlm2, btlmptr, index2 ) 

LVX BT( btlm3, btlmptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

LABEL ( mid_loop ) 

"* LVX( rqOO, 0, rOptr ) 

□ VMSUM( sumO, rqlO, btlmO, sumO ) 

-*1 LVX( rqOl, rOptr, indexl ) 

'% VMSUM{ suml, rqll, btlml, suml ) 

*l LVX( rq02, rOptr, index2 ) 

J3 VMSUM( sum2, rql2, btlm2, sum2 ) 

J3 LVX( rq03, rOptr, index3 ) 

3 LVX BT( btOO, 0, btOptr ) 

j?| VMSUM( sum3, rql3, btlm3, sum3 ) 

LVX BT( btOl, btOptr, indexl ) 
ADDI (rOptr , rOptr, LOOP BLOCK SIZE) 

U LVX BT( bt02, btOptr, index2 ) 

pjj LVX BT( bt03, btOptr, index3 ) 

ADDI (btOptr, btOptr, LOOP_BLOCK_SIZE) 

*p LVX( rqlmO, 0, rlmptr ) 

pj VMSUM( sumO, rqOO, btOO, sumO ) 

h! LVX( rqlml, rlmptr, indexl ) 

IM VMSUM( suml, rqOl, btOl, suml ) 

LVX( rqlm2, rlmptr, index2 ) 
VMSUM( sum2, rq02, bt02, sum2 ) 
LVX( rqlm3, rlmptr, index3 ) 

LVX BT< btlO, 0, btlptr ) 

VMSUM( sum3, rq03, bt03, sum3 ) 

LVX BT( btll, btlptr, indexl ) 

ADDI (rlmptr, rlmptr, LOOP BLOCK_SIZE) 

LVX BT( btl2, btlptr, index2 ) 

LVX BT( btl3, btlptr, index3 ) 

ADDI (btlptr, btlptr, LOOP_BLOCK_SIZE) 

/* ) */ 

BNE( loop ) 

/** 

Loop exit code 
**/ 

VMSUM( sumO, rqlmO , btlO , sumO ) 
VMSUM( suml, rqlml, btll, suml ) 
VMSUM( sum2, rqlm2 , btl2, sum2 ) 
VMSUM( sum3, rqlm3 , btl3, sum3 ) 

/** 

Remainders 
**/ 

LABEL (do_half_block) 
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AND I C( icount, N, HALF_BLOCK_BIT ) 

BEQ(do quarter block) 

LVX( rqlO, 0, rlptr ) 

LVX( rqll, rlptr, indexl ) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP BLOCK SIZE >> 1) ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sumO, rqlO, btlmO, sumO ) 
VMSUM( suml, rqll, btlml, suml ) 

LVX( rqOO, 0, rOptr ) 

LVX( rqOl, rOptr, indexl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT ( btOl, btOptr, indexl ) 

ADDI (rOptr, rOptr, (LOOP BLOCK SIZE » 1) ) 

ADDI (btOptr, btOptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sumO, rqOO, btOO, sumO ) 
VMSUM( suml, rqOl, btOl, suml ) 

LVX( rqlmO, 0, rlmptr ) 
LVX( rqlml, rlmptr, indexl ) 
LVX BT ( btlO, 0, btlptr ) 
LVX BT ( btll, btlptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP BLOCK SIZE >> 1) ) 
ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sumO, rqlmO , btlO, sumO ) 
VMSUM( suml, rqlml, btll, suml ) 

LABEL (do quarter block) 

AND I C( icount, N, QUARTER_BLOCK_BIT ) 

BEQ (combine) 

LVX( rqlO, 0, rlptr ) 

LVX BT ( btlmO, 0, btlmptr ) 

VMSUM( sumO, rqlO, btlmO, sumO ) 

LVX( rqOO, 0, rOptr ) 
LVX BT ( btOO, 0, btOptr ) 
VMSUM( sumO, rqOO, btOO, sumO ) 

LVX( rqlmO, 0, rlmptr ) 
LVX BT ( btlO, 0, btlptr ) 
VMSUM( sumO, rqlmO, btlO, sumO ) 

/** 

Combine sums and return 
**/ 

LABEL (combine) 

VXOR( zero, zero, zero ) 

VADDSWS ( sumO, sumO , suml ) /* sOO sOl s02 s03 */ 
VADDSWS ( sum2, sura2, sum3 ) /* s22 s21 s22 s23 */ 
VADDSWS ( sumO, sumO, sum2 ) /* sOO sOl s02 s03 */ 
VSUMSWS( sumO, sumO , zero ) /* xxx xxx xxx sOO */ 
VSPLTW( sumO, sumO, 3 ) /* sOO sOO sOO sOO */ 

STVEWX( sumO, 0, C ) 

/** 

Return 
**/ 

LABEL ( ret ) 

FREE THRU_v27( VRSAVE_COND ) 

REST rl3 

RETURN 
FUNC_EPILOG 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: dotpr6_8bit .mac 

Description: Source code for routine which computes six 
dot products, combining the six sums prior 
into two outputs prior to exit. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

Revision Date Engineer Reason 

0.0 000510 fpl Created 

0.1 000521 fpl Changed to fixed point 

0.2 000521 fpl Added num cached rows 

0.3 00 0605 fpl Changed to .k file 

0.4 000926 jg Back to .mac and no dsts 



,3 

a 
m 



#include "salppc.inc" 

#define LVX_BT ( vT, rA, rB ) 

#define FUNC ENTRY 
#define VMSUM( vT, vA, vB, vC ) 
#define LOOP COUNT SHIFT 6 
#define HALF BLOCK BIT 0x2 0 
#define QUARTER_BLOCK_B I T 0x10 



LVX( vT, rA, rB ) 



dotpr6 8bit 

VMSUMMBM ( vT, vA, vB, vC ) 



#defi 



LOOP BLOCK_SIZE 64 



Input parameters 
**/ 

#define btlmptr r3 
#define rlptr r4 
ttdefine rOptr r5 
#define rlmptr r6 
#define C r7 
ttdefine N r8 
#define hat_tc r9 
/** 

Local loop registers 
**/ 

#define btOptr rlO 
#define btlptr rll 
#define bt2ptr rl2 
#define indexl rl3 
#define index2 rl4 



G4 registers 
**/ 

#define rqlO vO 
#define rqll vl 
#define rql2 v2 
ttdefine rql3 v3 
#define zero v3 

#define rqOO v4 
#define rqOl v5 
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#define rq02 v6 
#define rq03 v7 

#define rqlmO v8 
ttdefine rqlml v9 
#define rqlm2 vlO 
ttdefine rqlm3 vll 

ttdefine btlmO vl2 
ttdefine btlml vl3 
ttdefine btlm2 vl4 
ttdefine btlm3 vl5 





#def ine 


btlO 


vl2 




ttdefine 


btll 


vl3 




ttdefine 


btl2 


v!4 




ttdefine 


btl3 


vl5 




#def ine 


btOO 


vl6 




ttdefine 


btOl 


vl7 




ttdefine 


bt02 


vl8 




ttdefine 


bt03 


vl9 




ttdefine 


bt20 


vl6 




ttdefine 


bt21 


vl7 




ttdefine 


bt22 


vl8 




ttdefine 


bt23 


vl9 



ttdefine sumOO v20 
ttdefine sumOl v21 
ttdefine sum02 v22 
ttdefine sum03 v23 

ttdefine sumlO v24 
ttdefine sumll v25 
ttdefine suml2 v26 
ttdefine suml3 v27 
/** 

Begin code text 
**/ 

FUNC PROLOG 

ENTRY 7( FUNC ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, 
SAVE rl3 rl4 

USE_THRU_v27 ( VRSAVE_COND ) 
/** 

Load up local loop registers 
**/ 

ADD(btOptr, btlmptr, hat tc) 
VXOR(sum00, sumOO, sumOO) 
ADD(btlptr, btOptr, hat_tc) 
LKindexl, 16) 
ADD(bt2ptr, btlptr, hat_tc) 

"VXOR(sum01, sumOl, sumOl) 
LI(index2, 32) 
VXOR(sum02, sum02, sura02) 
LI(index3, 48) 
VXOR(sum03, sura03 , sum03) 

VXOR(sumlO, sumlO, sumlO) 

VXOR( sumll, sumll, sumll) 

VXOR ( sutnl 2 , suml 2 , suml 2 ) 

VXOR ( suml 3 , suml 3 , suml 3 ) 

SRWI C(icount, N, LOOP_COUNT_SHIFT) 

BEQ (do_half_block) 

/** 

Loop entry code 
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**/ 

LVX BT( btlmO, 0, btlmptr ) 
DECR C(icount) 

LVX BT ( btlml, btlmptr, indexl ) 
LVX BT ( btlm2, btlmptr, index2 ) 
LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 

LVX( rqll, rlptr, indexl ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

LVX( rql2, rlptr, index2 ) 

LVX( rql3, rlptr, index3 ) 

BR ( mid_loop ) 

Loop computes three dot products held in 16 parts 
**/ 

LABEL ( loop ) 

/* { */ 

LVX BT ( btlmO, 0, btlmptr ) 
VMSUM( sumlO, rqlmO, bt2 0, sumlO ) 
LVX BT( btlml, btlmptr, indexl ) 
VMSUM( sumll, rqlml , bt21, sumll ) 
LVX BT ( btlm2, btlmptr, index2 ) 
P* DECR C(icount) 

p VMSUM( suml2, rqlm2 , bt22, suml2 ) 

LVX_BT( btlm3, btlmptr, index3 ) 

CI LVX( rqlO, 0, rlptr ) 

VMSUM( suml3, rqlm3, bt23, suml3 ) 

,>| LVX( rqll, rlptr, indexl ) 

S LVX( rql2, rlptr, index2 ) 

U ADDI(bt2ptr, bt2ptr, LOOP_BLOCK_SIZE) 

til LVX( rql3, rlptr, index3 ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 

13 LABEL ( mid_loop ) 

I™ LVX BT ( btOO, 0, btOptr ) 

r I VMSUM( sumOO, rqlO, btlmO, sumOO ) 

*p LVX BT ( btOl, btOptr, indexl ) 

f't VMSUM( sumOl, rqll, btlml, sumOl ) 

~\ LVX BT ( bt02, btOptr, index2 ) 

3¥ VMSUM( sum02, rql2 , btlm2, sum02 ) 

LVX BT ( bt03, btOptr, index3 ) 
ADDI (rlptr, rlptr, LOOP_BLOCK_SIZE) 

LVX( rqOO, 0, rOptr ) 

VMSUM( sum03, rql3 , btlm3, sum03 ) 

LVX( rqOl, rOptr, indexl ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) 

LVX( rq02, rOptr, index2 ) 

VMSUM( sumll, rqll, btOl, sumll ) 

ADDI (btOptr, btOptr, LOOP BLOCK SIZE) 

VMSUM( suml2, rql2, bt02 , suml2 ) 

LVX( rq03, rOptr, index3 ) 

VMSUM( suml3, rql3, bt03, suml3 ) 

LVX BT ( btlO, 0, btlptr ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 

LVX BT ( btll, btlptr, indexl ) 
ADDI (rOptr, rOptr, LOOP BLOCK_SIZE) 

LVX BT ( btl2, btlptr, index2 ) 

VMSUM( sumOl, rqOl, btOl, sumOl ) 

LVX BT ( btl3, btlptr, index3 ) 

VMSUM( sum02, rq02, bt02, sum02 ) 



LVX( rqlmO, 0, rlmptr ) 
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VMSUM( sum03, rq03, bt03, sum03 ) 
ADDI (btlptr, btlptr, LOOP BLOCK SIZE) 
VMSUM( sumlO, rqOO, btlO, sumlO ) 
LVX( rqlml, rlmptr, indexl ) 
VMSUM( sumll, rqOl, btll, sumll ) 
LVX( rqlm2, rlmptr, index2 ) 
VMSUMf suml2, rq02, btl2, suml2 ) 
LVX( rqlm3, rlmptr, index3 ) 
ADDI (rlmptr, rlmptr, LOOP_BLOCK_SIZE) 

LVX BT( bt20, 0, bt2ptr ) 

VMSUM( suml3, rq03, btl3, suml3 ) 

LVX BT( bt21, bt2ptr, indexl ) 

VMSUM( sumOO, rqlmO , btlO, sumOO ) 

LVX BT< bt22, bt2ptr, index2 ) 

VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt2 3, bt2ptr, index3 ) 

VMSUM( sum02, rqlm2 , btl2, sum02 ) 

VMSUM( sum03, rqlm3 , btl3, sum03 ) 

/* } */ 

BNE ( loop ) 

/** 

Loop exit code 

t: **/ 

P VMSUM( sumlO, rqlmO, bt20, sumlO ) 

f% VMSUM( sumll, rqlml, bt21, sumll ) 

ADDI(bt2ptr, bt2ptr, LOOP_BLOCK_SIZE) 

% VMSUM( suml2, rqlm2, bt22, suml2 ) 

VMSUM( suml3, rqlm3 , bt23, suml3 ) 

€l /** 

~A Remainders 

^ **/ 

131 LABEL (do half block) 

g ANDI C( icount, N, HAL F_B LO CK_B I T ) 

ps BEQ(do_quarter_block) 

W LVX BT( btlmO, 0, btlmptr ) 

%& LVX BT( btlml, btlmptr, indexl ) 

! ~ ADDI (btlmptr, btlmptr, (LOOP_BLOCK_SIZE >> 1) 

D LVX( rqlO, 0, rlptr ) 

pi LVX{ rqll, rlptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) 
VMSUM( sumOl, rqll, btlml, sumOl ) 

LVX BT( btOO, 0, btOptr ) 

LVX BT( btOl, btOptr, indexl ) 

ADDKbtOptr, btOptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumlO, rqlO, 
VMSUM( sumll, rqll, 

LVX( rqOO, 0, rOptr ) 
LVX( rqOl, rOptr, indexl ) 

ADDI (rOptr, rOptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 
VMSUM( sumOl, rqOl, btOl, sumOl ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT ( btll, btlptr, indexl ) 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSTJM( sumlO, rqOO, btlO, sumlO ) 
VMSUM( sumll, rqOl, btll, sumll ) 
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LVX( rqlmO, 0, rlmptr ) 

LVX( rqlml, rlmptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP_BLOCK_SIZE > 

VMSUM( sumOO, rqlmO, btlO, sumOO ) 
VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt2 0, 0, bt2ptr ) 

LVX BT( bt21, bt2ptr, indexl ) 

ADDI(bt2ptr, bt2ptr, (LOOP_BLOCK_SIZE > 

VMSUM( sumlO, rqlmO, bt2 0, sumlO ) 
VMSUM( sumll, rqlml, bt21, sumll ) 

LABEL (do quarter block) 

AND I C( icount, N, QUARTER_BLOCK_B I T ) 
BEQ (combine) 

LVX BT( btlmO, 0, btlmptr ) 

LVX( rqlO, 0, rlptr ) 

VMSUM( suraOO, rqlO, btlmO, sumOO ) 

LVX BT( btOO, 0, btOptr ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) 

LVX( rqOO, 0, rOptr ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) 

LVX BT( btlO, 0, btlptr ) 
VMSUM( sumlO, rqOO, btlO, sumlO ) 
LVX( rqlmO, 0, rlmptr ) 
VMSUM( sumOO, rqlmO , btlO, sumOO ) 
LVX BT( bt20, 0, bt2ptr ) 
VMSUM( sumlO, rqlmO, bt2 0, sumlO ) 



/* 



and return 



Combine 
**/ 

LABEL (combine) 

VXOR( zero, zero, zerc 

VADDSWS ( sumO 0 , sumO 0 , 

VADDSWS ( suml 0 , suml 0 , 

VADDSWS ( sum02, sum02, 

VADDSWS ( suml2, suml2, 

VADDSWS ( sumOO, sumOO, 



sumO 1 ) 
suml 1 ) 
sum03 ) 
suml 3 ) 
sum02 ) 

VADDSWS ( sumlO, sumlO, suml2 ) 

VSUMSWS( sumOO, sumOO, zero ) 

VSUMSWS ( sumlO, sumlO, zero ) 

VSPLTW( sumOO, sumOO, 3 ) 

STVEWX( sumOO, 0, C ) 

ADDI ( C, C, 4 ) 

VSPLTW( sumlO, sumlO, 3 ) 

STVEWX( sumlO, 0, C ) 

/** 

Return 
**/ 

LABEL ( ret ) 

FREE THRU v2 7( VRSAVE_COND ) 

REST rl3_rl4 

RETURN 
FUNC EPILOG 



/* sOO sOl s02 s03 */ 
/* s22 s21 s22 s23 */ 



/* sOO sOl s02 S03 */ 



/* xxx xxx xxx sOO */ 
/* sOO sOO sOO sOO */ 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: dotpr9_8bit .mac 

Description: Source code for routine which computes nine 
dot products, combining the nine sums prior 
into three outputs prior to exit. 

Mercury Computer Systems , Inc . 
Copyright (c) 2 00 0 All rights reserved 

Date Engineer Reason 

000510 fpl Created 

000512 fpl Added num cached_rows 

000521 fpl Changed to fixed point 

000605 fpl Changed to .k file 

000926 jg Back to .mac and no dsts 



Revision 



#include "salppc.inc" 

#define LVX_BT( vT, rA, rB ) 

#define FUNC ENTRY 
#define VMSUM( vT, vA, vB, vC ) 
#define LOOP COUNT SHIFT 6 
#define HALF BLOCK BIT 0x2 0 
#define QUARTER_BLOCK_B I T 0x10 

ttdefine LOOP_BLOCK_SIZE 64 



LVX( vT, rA, rB ) 



Input parameters 
**/ 

ttdefine btlmptr r3 
#define rlptr r4 
#define rOptr r5 
#define rlmptr r6 
#define C r7 
#define N r8 
ttdefine hat_tc r9 
/** 

Local loop registers 
**/ 

#define btOptr rlO 
#define btlptr rll 
#define bt2ptr rl2 
#define bt3ptr rl3 
#define indexl rl4 
#define index2 rl5 



G4 registers 
**/ 

#define rglO vO 
#define rqll vl 
#define rql2 v2 
#define rql3 v3 
#define zero v3 

#define bt30 vO 



Page No. 204 



EV 093 931 868 US 
Page No. 231 

dotpr9_8bit .mac 

#define bt31 vl 
#define bt32 v2 
#define bt33 v3 

#define rgOO v4 

#define rqOl v5 

#define rq02 v€ 

#define rq03 v7 

#define rqlmO v8 

#define rqlml v9 

#define rqlm2 vlO 

#define rqlm3 vll 

#define btlmO vl2 
ttdefine btlml vl3 
#define btlm2 vl4 
#define btlm3 vl5 

#define btlO vl2 

#define btll vl3 

#define btl2 vl4 

, #def ine btl3 vl5 

O #define btOO vl6 

rl #define btOl vl7 

#define bt02 vl8 
=g idefine bt03 vl9 

#define bt20 vl6 
IS ttdefine bt21 vl7 

#define bt22 vl8 
CM ttdefine bt23 vl9 

« #define sumOO v20 

Wp? #define sumOl v21 

III ttdefine sum02 v22 

y, ttdefine sum03 v23 

4= #define sumlO v24 

pi #define sumll v25 

=='= #define suml2 v2 6 

§y #define suml3 v27 



#define sura20 v28 
#define sum21 v29 
#define sum22 v30 
ttdefine sum23 v31 

/** 

Begin code text 
**/ 

FUNC PROLOG 

ENTRY 7( FUNC ENTRY, btlmptr, rlptr, rOptr, rlmptr, C, N, hat_tc ) 
SAVE rl3 rl5 

USE_THRU_v3 1 ( VRSAVE_COND ) 
/** 

Load up local loop registers 
**/ 

ADD(btOptr, btlmptr, hat tc) 
VXOR(sum00, sumOO, sumOO) 
ADD(btlptr, btOptr, hat_tc) 
LKindexl, 16) 
ADD(bt2ptr, btlptr, hat tc) 
VXOR(sum01, sumOl, sumOl) 
ADD{bt3ptr, bt2ptr, hat_tc) 
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LI ( index2 , 


32) 




VXOR(sum02, 


sum02 , 


sumO 2 ) 


LI ( index3 , 


48) 




VXOR { sumO 3 , 


sum03 , 


suin03 ) 


VXOR(sumlO, 


suml 0 , 


sumlO) 


VXOR(sumll, 


sumll. 


sumll) 


VXOR(suml2, 


sum!2 , 


suml 2) 


VXOR(suml3 f 


suml 3 , 


suml 3) 


VXOR(sum20, 


sum2 0 , 


sum2 0 ) 


VXOR(sum21, 


sum2 1 , 


sum21) 


VXOR(sum22, 


sum22, 


sum22) 


VXOR(sum23, 


sum23 , 


sum23) 


SRWI C(icount, N, 


LOOP_COUNT_SHIFT) 


BEQ(do_half 


_block) 





Loop entry code 



LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 



/* 



LVX_BT( btlm3, btlmptr, index3 ) 
LVX( rqlO, 0, rlptr ) 

ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 
LVX( rqll, rlptr, indexl ) 
LVX( rql2, rlptr, index2 ) 
LVX( rql3, rlptr, index3 ) 
LVX_BT( btOO, 0, btOptr ) 
BR( mid_loop ) 

Nine dot products producing 3 sums: 
sumO = (Rl * Btlm) (R0 * BtO) (Rim * Btl) 
suml = (Rl * BtO) (R0 * Btl) (Rim * Bt2) 
sum2 = (Rl * Btl) (R0 * Bt2) (Rim * Bt3) 
**/ 

LABEL ( loop ) 
/* { */ 

LVX BT( btlmO, 0, btlmptr ) 

VMSUM( sum20, rqlmO , bt30, sum20 ) /* Rim * Bt: 

LVX BT ( btlml, btlmptr, indexl ) 

VMSUM( sum21, rqlml, bt31, sum21 ) 

LVX BT( btlm2, btlmptr, index2 ) 

VMSUM( sum22, rqlm2 , bt32, sum22 ) 

LVX_BT( btlm3, btlmptr, index3 ) 

LVX( rqlO, 0, rlptr ) 
VMSUM( sum23, rqlm3, bt33, sum23 ) 
ADDI (btlmptr, btlmptr, LOOP_BLOCK_SIZE) 
LVX( rqll, rlptr, indexl ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* R0 * Bt2 
LVX( rql2, rlptr, index2 ) 
VMSUM( sum21, rqOl, bt21, sum21 ) 
DECR C(icount) 

VMSUM( sum22, rq02, bt22, sum22 ) 
LVX( rql3, rlptr, index3 ) 



/* 



VMSUM( sum23, rq03 , bt23, sum23 ) 
LVX_BT( btOO, 0, btOptr ) 



Loop entry 
**/ 

LABEL ( mid_loop ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) /* 
LVX_BT( btOl, btOptr, indexl ) 
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ADDI (rlptr, rlptr, LOOP BLOCK SIZE) 
LVX BT( bt02, btOptr, index2 ) 
VMSUM( sumOl, rqll, btlml, sumOl ) 
LVX_BT( bt03, btOptr, index3 ) 

VMSUM( sum02, rql2, btlm2, sum02 ) 

LVX( rqOO, 0, rOptr ) 

VMSUM( sum03, rql3, btlm3, sum03 ) 

ADDI (btOptr, btOptr, LOOP_BLOCK_SIZE) 

VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 

LVX( rqOl, rOptr, indexl ) 

VMSUM( sumll, rqll, btOl, sumll ) 

LVX{ rq02, rOptr, index2 ) 

VMSUM( sural2, rql2, bt02, suml2 ) 

LVX( rq03, rOptr, index3 ) 

ADDI (rOptr , rOptr, LOOP BLOCK SIZE) 

VMSUM( suml3, rql3, bt03, suml3 ) 

LVX BT( btlO, 0, btlptr ) 

LVX BT( btll, btlptr, indexl ) 

VMSUM{ sumOO, rqOO, btOO, sumOO ) /* RO * BtO */ 

LVX BT( btl2, btlptr, index2 ) 

VMSUM( sumOl, rqOl, btOl, sumOl ) 

LVX_BT( btl3, btlptr, index3 ) 

VMSUM( sum02, rq02, bt02, sum02 ) 
VMSUM( sum03, rq03, bt03, sum03 ) 
LVX ( rqlmO , 0 , rlmptr ) 

VMSUM( sum20, rqlO, btlO, sum20 ) /* Rl * Btl */ 

LVX( rqlml, rlmptr, indexl ) 

VMSUM( sum21, rqll, btll, sum21 ) 

LVX( rqlm2, rlmptr, index2 ) 

ADDI (btlptr, btlptr, LOOP BLOCK_SIZE) 

LVX( rqlm3, rlmptr, index3 ) 

VMSUM( sum22, rql2, btl2, sum22 ) 
LVX BT( bt20, 0, bt2ptr ) 
VMSUM( sum23, rql3, btl 3, sum23 ) 
LVX BT( bt21, bt2ptr, indexl ) 

VMSUM( sumlO, rqOO, btlO, sumlO ) /* RO * Btl */ 

ADDI (rlmptr, rlmptr, LOOP_BLOCK_SIZE) 

VMSUM( sumll, rqOl, btll, sumll ) 

LVX BT( bt22, bt2ptr, index2 ) 

VMSUM( suml2, rq02, btl2, suml2 ) 

LVX_BT( bt23, bt2ptr, index3 ) 

VMSUM( suml3, rq03, btl3, suml3 ) 

LVX BT( bt30, 0, bt3ptr ) 

LVX BT{ bt31, bt3ptr, indexl ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl * 

LVX BT{ bt32, bt3ptr, index2 ) 

VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX_BT( bt3 3, bt3ptr, index3 ) 

VMSUM( sum02, rqlm2 , btl2, sum02 ) 

VMSUM( sum03, rqlm3 , btl3, sum03 ) 

ADDI (bt2ptr, bt2ptr, LOOP BLOCK SIZE) 

VMSUM( sumlO, rqlmO, bt20, sumlO ) /* Rim * Bt2 * 

VMSUM( sumll, rqlml, bt21, sumll ) 

ADDI (bt3ptr, bt3ptr, LOOP BLOCK SIZE) 

VMSUM( suml2, rqlm2 , bt22, suml2 ) 

VMSUM( suml3, rqlm3 , bt23, suml3 ) 
/* } */ 

BNE ( loop ) 

Loop exit code 
**/ 
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VMSUM( sum20, rqltnO, bt30, sum20 ) /* Rim * Bt3 */ 

VMSUM( sum21 ( rqlml, bt31, sum21 ) 

VMSUM( sum22 / rqlm2, bt32, sum22 ) 

VMSUM( sura23 / rqlm3, bt33, sum23 ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 

VMSUM( sum21, rqOl, bt21, sum21 ) 

VMSUM( sum22, rq02, bt22 , sum22 ) 

VMSUM( sum23, rq03, bt23, sum23 ) 

/** 

Remainders 
**/ 

LABEL (do half block) 

AND I C( icount, N, HALF_BLOCK_BIT ) 
BEQ (do_quarter_block) 

LVX BT( btlmO, 0, btlmptr ) 

LVX BT( btlml, btlmptr, indexl ) 

ADDI (btlmptr, btlmptr, (LOOP_BLOCK_SIZE >> 1) ) 

LVX( rqlO, 0, rlptr ) 
LVX( rqll, rlptr, indexl ) 

ADDI (rlptr, rlptr, (LOOP_BLOCK_SIZE » 1) ) 

l =i VMSUM( sumOO, rqlO, btlmO, sumOO ) /* Rl * Btlm */ 

O VMSUM( sumOl, rqll, btlml, sumOl ) 

LVX BT( btOO, 0, btOptr ) 
y=r 5 LVX BT( btOl, btOptr, indexl ) 

; : jp ADDI (btOptr, btOptr, (LOOP_BLOCK_SIZE >> 1) ) 

5 VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl * BtO */ 

Q VMSUM( sumll, rqll, btOl, sumll ) 

m 

LVX( rqOO, 0, rOptr ) 
LVX( rqOl, rOptr, indexl ) 
W ADDI (rOptr, rOptr, (LOOP_BLOCK_SIZE >> 1) ) 

■[7 VMSUM( sumOO, rqOO, btOO, sumOO ) /* RO * BtO */ 

VMSUM( sumOl, rqOl, btOl, sumOl ) 

Fs LVX BT( btlO, 0, btlptr ) 

^; LVX BT( btll, btlptr, indexl ) 

ADDI (btlptr, btlptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sum20, rqlO, btlO, sum20 ) /* Rl * Btl */ 

VMSUM( sum21, rqll, btll, sum21 ) 

VMSUM( sumlO, rqOO, btlO, sumlO ) / * RO * Btl */ 

VMSUM( sumll, rqOl, btll, sumll ) 

LVX( rqlmO, 0, rlmptr ) 
LVX( rqlml, rlmptr, indexl ) 

ADDI (rlmptr, rlmptr, (LOOP_BLOCK_SIZE » 1) ) 

VMSUM( sumOO, rqlmO , btlO, sumOO ) /* Rim * Btl */ 
VMSUM( sumOl, rqlml, btll, sumOl ) 

LVX BT( bt2 0, 0, bt2ptr ) 

LVX BT( bt21, bt2ptr, indexl ) 

ADDI(bt2ptr, bt2ptr, (LOOP_BLOCK_SIZE >> 1) ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 

VMSUM( sum21, rqOl, bt21, sum21 ) 

VMSUM( sumlO, rqlmO , bt20, sumlO ) /* Rim * Bt2 */ 

VMSUM( sumll, rqlml, bt21, sumll ) 
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LVX BT( bt3 0, 0, bt3ptr ) 

LVX BT( bt31, bt3ptr, indexl ) 

ADDI (bt3ptr, bt3ptr, (L00P_BL0CK_SIZE >> 1) ) 

VMSUM( sum2 0, rqlmO, bt30, sum20 ) /* Rim * Bt 
VMSUM( sum21, rqlral, bt31, sum21 ) 

/** 

four more sums 
**/ 

LABEL (do quarter block) 

AND I C( icount, N, QUARTER_BLOCK_BIT ) 
BEQ (combine) 

LVX BT ( btlmO, 0, btlmptr ) 
LVX( rqlO, 0, rlptr ) 

VMSUM( sumOO, rqlO, btlmO, sumOO ) /* Rl * Btl 
ADDI (btlmptr, btlmptr, 16) 

LVX BT ( btOO, 0, btOptr ) 

VMSUM( sumlO, rqlO, btOO, sumlO ) /* Rl * BtO 

LVX( rqOO, 0, rOptr ) 

VMSUM( sumOO, rqOO, btOO, sumOO ) /* RO * BtO 

LVX_BT( btlO, 0, btlptr ) 

VMSUM( sum20, rqlO, btlO, sum20 ) /* Rl * Btl 

VMSUM( sumlO, rqOO, btlO, sumlO ) /* RO * Btl 



LVX( rqlmO, 0, rlmptr ) 

VMSUM( sumOO, rqlmO, btlO, sumOO ) /* Rim * Btl */ 
LVX BT ( bt20, 0, bt2ptr ) 

VMSUM( sum20, rqOO, bt20, sum20 ) /* RO * Bt2 */ 
VMSUM( sumlO, rqlmO, bt20, sumlO ) /* Rim * Bt2 */ 



LVX BT ( bt30, 
VMSUM( sum20, 



0, bt3ptr ) 
rqlmO, bt3 0. 



sum20 ) /* Rim ■ 



/** 








Combine sui 


ms and 


return 




**/ 








LABEL (combine) 






VXOR( zero, zero, zero 


) 


VADDSWS ( 


sumOO , 


sumOO, 


sumOl 


VADDSWS ( 


sumlO, 


sumlO, 


sumll 


VADDSWS ( 


sum2 0 , 


sum2 0 , 


sum21 


VADDSWS ( 


sum02. 


sum02 , 


sum03 


VADDSWS ( 


suml2 , 


suml 2 , 


suml 3 


VADDSWS ( 


sum22, 


sum22, 


sum2 3 


VADDSWS { 


sumOO , 


sumO 0 , 


sum02 


VADDSWS ( 


sumlO , 


sumlO , 


suml 2 


VADDSWS ( 


sum2 0 , 


sum20, 


sum22 


VSUMSWS ( 


sumO 0 , 


. sumO 0 , 


zero 


VSUMSWS ( 


sumlO, 


sumlO, 


zero 


VSUMSWS ( 


sum2 0 , 


. sum2 0 , 




VSPLTW( 


sumOO, 


sumO 0 , 


3 ) 


STVEWX ( 


sumOO , 


0, C ) 




ADDI ( C , 


c, 4 : 


i 




VSPLTW( 


suml 0 , 


sumlO, 


3 ) 


STVEWX ( 


sumlO, 


0, C ) 




ADDI ( C, 


C, 4 : 


1 




VSPLTW ( 


sum2 0 , 


sum2 0 , 


3 ) 


STVEWX ( 


sum20, 


0, C ) 





/* XXX XXX XXX sOO */ 



/* sOO sOO sOO sOO */ 
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Return 
**/ 

LABEL ( ret ) 

FREE THRU v3 1 ( VRSAVE_COND ) 

REST rl3_rl5 

RETURN 
FUNC EPILOG 
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flifndef MCOS 55 
^define MCOS_55 0 
flendif 
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Description: Vector Single Precision Complex Dot Product 
Entry/params : CDOTPR (A, I, B, J, C, N) 
Formula: C[0] = sum (A [ml] *B [mj] - A [ml+l] *B [mJ+1] ) 
C[l] = sum (A[mI]*B[mJ+l] + A [ml+l] *B [mJ] ) 
for m=0 to N-l 
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Added conjugate entry 
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fpl 


Increased minimum VMX count 
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jfkremoved branches to entrypomti 
jfk Fixed floating point save bug 


1 


2 


000607 


1 


3 


000610 


fpl 


Added new API macro 



#include "salppc.inc" 
#undef BR IF VMX Z2 

#define BR_IF_VMX_Z2 ( root_name, uroot iu 
prl, pil, si, pr2, 

cmplwi n, min n_imm; \ 

bit z_skip vmx; \ 

cmpwi si, unit s imm; \ 

bne z_skip vmx; \ 

cmpwi s2, unit s imm; \ 

xor rO, prl, pil; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, pr2, pi2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

xor rO, prl, pr2; \ 

bne z_skip vmx; \ 

andi. rO, rO, Oxf; \ 

bne z unaligned vmx; \ 

BR VMX Z2< root_name, eflag, si ) \ 
z_unaligned vmx: \ 

BR VMX Z2 ( uroot_name, eflag, si ) \ 
z_skip_vmx : 

#define ACOND 5 
#define ABIT 2 
#define BCOND 6 



mm, unit_s_imm, \ 
eflag ) \ 
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fldefine BBIT 1 



/** 

API registers 
**/ 

#define A r3 
ttdefine I r4 
#define B r5 
#define J r6 
#define C r7 
ttdefine N r8 
#define EFLAG r9 

/** 

z input args 
**/ 

#define Ar A 

#define Ai rlO 

#define Br B 

tdefine Bi rll 

#define Cr C 

#define Ci rl2 

/** 

Local registers 
**/ 

ttdefine count rl3 
#define rtmp rl3 
#define nextline rl4 

/** 

Fpu registers 
**/ 

#define rsumrO fO 

#define rsumiO fl 

ttdefine isumrO f2 

#define isumiO f3 

#define arO f4 

ttdefine aiO f5 

ttdefine arl f6 

ttdefine ail f7 

#define ar2 f8 

#define ai2 f9 

#define ar3 flO 

#define ai3 fll 

#define brO fl2 
ttdefine biO fl3 
#define brl fl4 
#define bil fl5 
#define br2 fl6 
#define bi2 fl7 
#define br3 fl8 
#define bi3 fl9 

#if defined ( BUILD_MAX ) 
#if MCOS 55 

DECLARE_VMX_Z2 ( _zdotpr_vmx_cc ) 
ttelse 

DECLARE_VMX_Z2 ( _zdotpr_vmx ) 
#endif 

DECLARE_VMX_Z2 ( _zdotpr4_vmx ) 
#endif 

/** 

Code text: Conjugate 
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FUNC PROLOG 
#ifndef COMPILE C 
U_ENTRY( fixed cidotpr ) 

FORTRAN DREF 3( I, J, N 
U_ENTRY{ fixed cidotpr } 

LI ( EFLAG, SAL NNN ) 

BR( cidotprx common ) 
U_ENTRY ( fixed cidotprx ) 

FORTRAN DREF 4( I, J, N, 
U ENTRY { fixed cidotprx ) 
LABEL ( c idotprx common ) 

ADDI ( Ai, Ar, 4 ) 

MR ( Bi, Br ) 

ADDI ( Br, Br, 4 ) 

MR ( Ci, Cr ) 

ADDI ( Cr, Cr, 4 ) 

BR ( common ) 

/** 

Normal 
**/ 

FUNC PROLOG 
#ifndef COMPILE C 
U_ENTRY( fixed cdotpr_ ) 

FORTRAN DREF 3( I, J, N ) 
U_ENTRY( fixed cdotpr ) 

LI ( EFLAG, SAL NNN ) 

BR ( cdotprx common ) 
U_ENTRY( fixed cdotprx ) 

FORTRAN DREF 4( I, J, N, EFLAG ) 
U ENTRY { fixed cdotprx ) 
LABEL ( cdotprx common ) 

ADDI ( Ai, Ar, 4 ) 

ADDI ( Bi, Br, 4 ) 

ADDI ( Ci, Cr 

BR ( common ) 

/* 



/* Fortran SAL */ 

/* C SAL */ 
NNN EFLAG (default) */ 
common path */ 

/* Fortran ESAL */ 



/* C ESAL */ 

/* common path */ 



/* common path */ 



/* Fortran SAL */ 

/* C SAL */ 
NNN EFLAG (default) */ 
common path */ 

/* Fortran ESAL */ 



/* C ESAL */ 

/* common path */ 



/* common path */ 



Split complex entries: Conjugate 
*/ 



/* Fortran SAL */ 



U_ENTRY( fixed zidotpr ) 

FORTRAN DREF 3( I, J, N 
U_ENTRY( fixed zidotpr ) 
LI ( EFLAG, SAL NNN ) 
BR ( z idotprx common ) 
U_ENTRY( fixed z idotprx ) 

FORTRAN_DREF_4 ( I, J, N, 
ttendif 

ENTRY 7 ( fixed zidotprx, A, 
LABEL ( zidotprx_common ) 

Assign split complex pointers, do the conjugate trick 



/* C SAL */ 
NNN EFLAG (default) */ 



/* Fortran ESAL */ 



C, N, EFLAG) 



*/ 



4 ) 



LWZ( Ai, A, 

LWZ( Ar, A, 0 

LWZ( Bi, B, 0 

LWZ( Br, B, 4 

LWZ( Ci, C, 0 

LWZ( Cr, C, 4 
BR ( z_common ) 



Normal 
**/ 

U_ENTRY( fixed zdotpr_ ) 

FORTRAN DREF 3( I, J, N ) 

U_ENTRY( fixed zdotpr ) 
LI( EFLAG, SAL NNN ) 
BR ( zdotprx_common ) 



/* Fortran SAL */ 



/* C SAL */ 
NNN EFLAG (default) */ 
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U_ENTRY( fixed zdotprx ) /* Fortran ESAL 

FORTRAN_DREF_4 ( I, J, N, EFLAG ) 
tfendif 
/** 
C ESAL 

ENTRY 7( fixed zdotprx, A, I, B, J, C, N, EFLAG) 
DECLARE rlO rl4 
DECLARE_f 0_f 19 

LABEL ( zdotprx_common ) 

/* 



Assign 


spl 


it 


complex 


**/ 








LWZ ( 


Ai, 


A, 


4 ) /* 


LWZ( 


Ar, 


A, 


0 ) 


LWZ( 


Bi, 


B , 


4 ) 


LWZ( 


Br, 


B, 


0 ) 


LWZ( 


Ci, 


c. 


4 ) 


LWZ ( 


Cr, 


c, 


0 ) 



must load imag first since Ar reg = A reg */ 



/** 

VMX API filter 

v ; Test if okay to enter VMX code and branch to VMX code 

Cl VMX loop - process all N points 

g **/ 

*f LABEL ( z_common ) 

yQ #if defined! BUILD_MAX ) 



UNIT_STRIDE, \ 

zdotpr4_vmx, MIN_VMX_N, UNIT_STRIDE, \ 
Ar, Ai, I, Br, Bi, J, N, EFLAG ) 
ttendif 

#endif /* BUILD_MAX */ 

^Point of common path where all entries join 
Test for small counts 

LABEL ( common ) 
SAVE rl3 rl4 
SAVE fl4 fl9 
CMPLWI (N, 0) 
BEQ(ret) 
CMPLWI (N, 1) 
BEQ(dol) 
CMPLWI (N, 2) 
BEQ (do2) 
CMPLWI (N, 3) 
BEQ (do3) 

check for uncached (and local) vectors 
^SET_2_DCBT_COND( ACOND, ABIT, BCOND, BBIT, EFLAG, rtmp ) 
LKnextline, 32) 
74 0 code segment, start up loop code 



Page No. 214 



EV 093 931 868 US 
Page No. 241 

f ixed_c do tpr . ma c 



#if defined* BUILD 750 ) || defined* BUILD_MAX ) 
LFS( " 



CP 

p 



T 
Q 



SRWI( count, N, 
LFS( brO, Br, 0 
SLWI( I, I, 2 ) 
LFS( aiO, Ai, 0 
SLWK J, J, 2 ) 
LFS( biO, Bi, 0 



) /* count = N » : 
/* byte strides ' 



LFSUX( arl, Ar, I ) 

LFSUX( brl, Br, J ) 

LFSUX( ail, Ai , I ) 

LFSUX( bil, Bi, J ) 

LFSUXf ar2, Ar, I ) 

LFSUX( br2, Br, J ) 

LFSUX( ai2, Ai , I ) 

LFSUX( bi2, Bi, J ) 

FMULS( rsumrO, arO, brO ) 
LFSUX( ar3, Ar, I ) 
LFSUX( br3, Br, J ) 
FMULS( rsumiO, aiO, biO ) 
LFSUX( ai3, Ai, I ) 
LFSUX( bi3, Bi, J ) 
FMULS ( isumiO, arO, biO ) 
DECR C( count ) 
FMULS ( isumrO, aiO, brO ) 
BEQ ( flush loop_740) 
BR(mloop_74 0) 

/** 

Top of 74 0 loop 
**/ 

LABEL ( loop_740 ) 

LFSUX( ar3, Ar, I ) 

FMADDS ( rsumrO, arO , brO , rsumrO ) 
LFSUX{ br3, Br, J ) 

FMADDS ( rsumiO, aiO, biO, rsumiO ) 
LFSUX( ai3, Ai 
FMADDS ( isumiO, arO, 
FMADDS ( isumrO, 
LFSUX( bi3, Bi, 



LABEL ( ml oop_7 4 0 ) 

FMADDS ( rsumrO , 
LFSUX( arO, Ar, 



biO, isumiO ) 
brO, isumrO ) 



arl, brl, rsumrO ) 



DCBT IF( ACOND, Ar, nextline ) 
FMADDS ( rsumiO, ail, bil, rsumiO ) 
LFSUX( brO, Br, J ) 

DECR C{ count ) 

FMADDS ( isumiO, arl, bil, isumiO ) 

LFSUX< aiO, Ai , I ) 

FMADDS ( isumrO, ail, brl, isumrO ) 

LFSUX( biO, Bi, J ) 

DCBT IF( BCOND, 
FMADDS { rsumrO , 

LFSUX( arl, Ar, I ) 

LFSUX( brl, Br, J ) 

FMADDS ( rsumiO, ai2, bi2, rsumiO ) 

LFSUX( ail, Ai, I ) 

FMADDS ( isumiO, ar2 , bi2, isumiO ) 

LFSUX( bil, Bi, J ) 

FMADDS ( isumrO, ai2, br2, isumrO ) 
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FMADDS ( rsumrO , 
LFSUX( ar2, Ar, 
FMADDS ( rsumi 0 , 
LFSUX( br2, Br, 
FMADDS ( isumiO, 
LFSUX{ ai2, Ai , 
LFSUX( bi2, Bi, 
FMADDS ( isumrO, 
BNE ( loop_74 0 ) 

/** 

Finish last pass 
**/ 

FMADDS ( rsumrO, 
LFSUX( ar3, Ar, 
LFSUX( br3, Br, 
FMADDS ( rsumi 0, 
LFSUX( ai3, Ai , 
LFSUX{ bi3, Bi, 



br3, rsumrO ) 
bi3, rsumiO ) 
bi3, isumiO ) 



ai3, br3, isumrO ) 



brO, rsumrO ) 
biO, rsumi 0 ) 





FMADDS { 


isumiO, 


arO, 


biO, 






FMADDS ( 


isumrO, 


aiO, 


brO, 


isumrO 




LABEL ( flush loop 740 ) 








FMADDS ( 


rsumrO , 


arl. 


brl, 


rsumrO 




FMADDS ( 


rsumi 0, 


ail, 


bil. 


rsumi 0 




FMADDS ( 


isumiO, 


arl, 


bil, 


isumiO 




FMADDS ( 


isumrO, 


ail, 


brl, 


isumrO 




FMADDS ( 


rsumrO, 


. ar2. 


br2, 


rsumrO 




FMADDS ( 


rsumi 0, 


. ai2, 


bi2, 


rsumi 0 




FMADDS ( 


isumiO, 


. ar2 , 


bi2, 






FMADDS ( 


isumrO, 


. ai2, 


br2, 


isumrO 




FMADDS { 


rsumrO, 


ar3. 


br3, 


rsumrO 




FMADDS ( 


rsumi 0 , 


. ai3, 


bi3, 


rsumi 0 




FMADDS ( 


isumiO, 


. ar3, 


bi3, 


isumiO 




FMADDS ( 


isumrO, 


, ai3. 


br3, 


isumrO 



BR (remain) 

#endif /** 750 specific code section ■ 

set up for loop entry, here if N >= 2 
**/ 

#if defined ( BUILD_603 ) 
LABEL (start 6 03) 

LFS ( arO, Ar, 0 ) 

SLWI (1,1,2) 

LFS( aiO, Ai, 0 ) 

SRWI( count, N, 2 

LFSUX( arl, Ar, I 

SLWK J, J, 2 ) 

LFSUX( ail, Ai, I 

LFSUX( ar2, Ar, I 

LFSUX( ai2, Ai, I 

LFSUX( ar3, Ar, I 

LFSUX( ai3, Ai, I 

DCBT_IF( ACOND, P. 



byte strides */ 



next line ) 



LFS( brO, Br, 0 
DECR_C( count ) 
LFS( biO, Bi, 0 
LFSUX( brl, Br, 
LFSUX( bil, Bi, 
LFSUX( br2, Br, 
LFSUX( bi2, Bi, 
LFSUX( br3, Br, 
LFSUX( bi3, Bi, 
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DCBT_IF( BCOND, Br, nextline ) 

FMULS( rsumrO, arO, brO ) 
FMULS( rsuraiO, aiO, biO ) 
FMULS( isumiO, arO, biO ) 



FMULS ( isumrO, 


aiO, 


brO ) 




FMADDS ( 


rsumrO , 


arl, 


brl, 


rsumrO 


FMADDS ( 




ail, 


bil, 


rsumi 0 


FMADDS ( 


isumiO, 


arl. 


bil, 


isumiO 


FMADDS ( 


isumrO, 


ail, 


brl, 


isumrO 


FMADDS ( 


rsumrO , 


ar2. 


br2. 


rsumrO 


FMADDS ( 




ai2, 


bi2, 


rsumi 0 


FMADDS ( 


isumiO , 


ar2, 


bi2. 


isumiO 


FMADDS ( 


isumrO, 


ai2. 


br2. 


isumrO 


FMADDS ( 


rsumrO , 


ar3, 


br3, 


rsumrO 


FMADDS ( 


rsumi 0 , 


ai3, 


bi3, 


rsumi 0 


FMADDS ( 


isumiO , 


ar3, 


bi3, 


isumiO 


FMADDS ( 


isumrO , 


ai3. 


br3, 


isumrO 



main loop maintains four partial sums 
representing two complex sum updates per pass 
**/ 

LABEL (loop) 

LFSUX( arO, Ar, I ) 

LFSUX( aiO, Ai , I ) 

LFSUX( arl, Ar, I ) 

LFSUX( ail, Ai, I ) 

LFSUX( ar2, Ar, I ) 

LFSUX( ai2, Ai, I ) 

LFSUX( ar3, Ar, I ) 

LFSUX( ai3, Ai, I ) 

DCBT_IF( ACOND, Ar, nextline ) 

DECR C( count ) 

LFSUX( brO, Br, J ) 

LFSUX( biO, Bi, J ) 

LFSUX( brl, Br, J ) 

LFSUX( bil, Bi , J ) 

LFSUX( br2, Br, J ) 

LFSUX( bi2, Bi, J ) 



FMADDS ( 
FMADDS ( 
FMADDS ( 



br3, Br, 


J ) 






bi3, Bi, 


J ) 






?( BCOND, 


Br, 


nextline ) 


( rsumrO, 


arO, 


brO, 


rsumrO 


( rsumi 0, 


aiO , 


biO, 




( isumiO, 


arO, 


biO, 


isumiO 


( isumrO, 


aiO, 


brO, 


isumrO 


( rsumrO, 


arl, 


brl, 


rsumrO 


( rsumi 0, 


ail, 


bil, 


rsumi 0 


( isumiO, 


arl, 


. bil, 


isumiO 


( isumrO, 


ail, 


. brl, 


isumrO 


( rsumrO , 


ar2, 


, br2. 


rsumrO 


( rsumi 0, 


ai2 , 


, bi2. 




( isumiO, 


ar2, 


, bi2. 


isumiO 


( isumrO, 


ai2. 


, br2, 


isumrO 


( rsumrO, 


ar3. 


, br3, 


rsumrO 


( rsumiO, 


ai3 


, bi3. 


rsumi 0 


( isumiO, 


ar3 


, bi3, 


isumiO 


( isumrO, 


ai3 


, br3, 


isumrO 
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BNE ( loop ) 
tfendif /** 603 specific code section **/ 



remainder loop 
**/ 

LABEL (remain) 

ANDI_C( count, 
BEQ ( suml ) 



LFSUX( arO, Ar, 

LFSUX( aiO, Ai, 

LFSUX( arl, Ar, 

LFSUX( ail, Ai, 

LFSUX( brO, Br, 

LFSUX( biO, Bi, 

LFSUX( brl, Br, 

LFSUX( bil, Bi, 

FMADDS ( rsumrO , 

FMADDS ( rsumiO, 

FMADDS ( isumiO, 

FMADDS ( isumrO, 

FMADDS ( rsumrO , 

FMADDS ( rsumiO, 

FMADDS ( isumiO, 

FMADDS ( isumrO, 

LABEL ( suml } 

ANDI_C ( count , 
BEQ ( combine ) 



LFSUX( arO, Ar, I ) 

LFSUX( brO, Br, J ) 

LFSUX( aiO, Ai, I ) 

LFSUX( biO, Bi, J ) 



2 ) /* bit 2 */ 



J ) 
J ) 
J ) 
J ) 

arO, brO, rsumrO ) 

aiO, biO, rsumiO ) 

arO, biO, isumiO ) 

aiO, brO, isumrO ) 

arl, brl, rsumrO ) 

ail, bil, rsumiO ) 

arl, bil, isumiO ) 

ail, brl, isumrO ) 



/* bit 0 */ 

/* if no sums left */ 



FMADDS ( rsumrO, arO, brO, rsumrO ) 

FMADDS ( rsumiO, aiO, biO, rsumiO ) 

FMADDS ( isumiO, arO, biO, isumiO ) 

FMADDS ( isumrO, aiO, brO, isumrO ) 

combine partial sums, write out results and return 
**/ 

LABEL ( combine ) 

FSUBS( rsumrO, rsumrO, rsumiO ) /** rsumrO = rsum: 
STFS( rsumrO, Cr, 0 ) /** *(S + 0) = r; 

FADDS { isumiO, isumiO, isumrO ) 
STFS( isumiO, Ci, 0 ) 
BR (ret) 

/** 

here for N = 
**/ 

LABEL (do3 ) 
LFS( arO, 
SLWI (1,1,2) 
LFS( aiO, Ai, 0 ) 
LFSUX( arl, Ar, I 
SLWI ( J, J, 2 ) 
LFSUX( ail, Ai, I 
LFSUX( ar2, Ar, I 
LFSUX( ai2, Ai, I 

LFS( brO, Br, 0 ) 
DECR_C ( count ) 
LFS( biO, Bi, 0 ) 



1,2,3 



Ar, 



/* byte strides */ 
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LFSUX( brl, Br, J ) 

LFSUX( bil, Bi, J ) 

LFSUX( br2, Br, J ) 

LFSUX( bi2, Bi, J ) 

FMULS( rsumrO, arO, brO ) 

FMULS( rsumiO, aiO, biO ) 

FMULS( isumiO, arO, biO ) 

FMULS( isumrO, aiO, brO ) 



FMADDS ( rsumrO , 

FMADDS ( rsumiO, 

FMADDS ( isumiO, 

FMADDS < isumrO , 

FMADDS ( rsumrO 
FMADDS ( rsumiO 
FMADDS ( isumiO 
FMADDS ( isumrO 
BR (combine) 

LABEL (do2) 

LFS( arO, Ar, 0 
SLWK I, I, 2 ) 
LFS( aiO, Ai, 0 
LFSUX( arl, Ar, 
SLWK J, J, 2 ) 
LFSUX( ail, Ai, 

LFS( brO, Br, 0 
LFS( biO, Bi, 0 
LFSUX( brl, Br, 
LFSUX{ bil, Bi, 



arl, brl, rsumrO ) 

ail, bil, rsumiO ) 

arl, bil, isumiO ) 

ail, brl, isumrO ) 

ar2, br2, rsumrO ) 

ai2, bi2, rsumiO ) 

ar2, bi2, isumiO ) 

ai2, br2, isumrO ) 



/* byte strides */ 



J ) 



FMULS( rsumrO, arO, brO ) 

FMULS( rsumiO, aiO, biO ) 

FMULS( isumiO, arO , biO ) 

FMULS( isumrO, aiO, brO ) 



arl, brl, rsumrO ) 



0 ) 



FMADDS ( rsumrO , 
FMADDS ( rsumiO . 
FMADDS ( isumiO. 
FMADDS ( isumrO 
BR ( combine ) 

LABEL (dol) 

LFS< aiO, Ai, ' 
LFS( biO, Bi, ' 
LFS( brO, Br, 
LFS{ arO, Ar, 



FMULS ( rsumiO, aiO, biO) 

FMULS( isumrO, aiO, brO) 

FMSUBS( rsumrO, arO , brO , rsumiO) 

STFS( rsumrO, Cr, 0 ) 

FMADDS ( isumiO, arO , biO, isumrO) 

STFS( isumiO, Ci, 0 ) 

/** 

return 
**/ 

LABEL (ret) 

REST fl4 fl9 

REST rl3_rl4 

RETURN 
FUNC_EPILOG 



9 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: GEN R SUMS. MAC 

Description: Multiple small dot product routine for wireless 
group application. 

Entry/params: 

GEN_R_SUMS (X_bf, Coor_bf, Ptoyjmap, R_sums, Num_phys_users ) 



Num phys users; i++ ) { 
< (int) Ptov_map[i] ; j++ ) 



Formula : 

num_sums = 
for ( i = 0 
for ( j = 
sum = 0; 

for ( k = 0; k < 16; k++ ) { 
sum += (BF32)X bf [k] .real 
sum += (BF32)X_bf [k] . imag 
++Corr_bf ; 



} 



X_bf += N_FINGERS_MAX_SQUARED ; 

Mercury Computer Systei 



(BF32)Corr bf->real; 
(BF32) Corr_b f - > imag ; 



Inc . 



Copyright (c) 2000 All rights reserved 
Revision Date Engineer Reason 

0.0 000906 fpl Created 



#include "salppc.inc" 



#if DO 10 

#define PTOV BUMP 1 
CORR BUMP 32 
CORR BUMP_64 
X BUMP 64 
RSUM BUMP 8 
RSUM_BUMP_4 



64 



ttdefine 
#def ine 
#define 
#define 
#def ine 
#else 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#endif 



#define LOAD_CORR( vT, rA, rB ) 



PTOV BUMP 1 i 
CORR BUMP 32 
CORR BUMP_64 
X BUMP 64 0 
RSUM BUMP 8 
RSUM BUMP 4 



LVX( vT, rA, rB ) 



#defi] 



DST_BUMP CORR_BUMP_64 



#if DO PREFETCH 

#define PREFETCH ( rA, rB, STRM ) \ 
DST( rA, rB, STRM ) \ 
ADDI ( rA, rA, DST_BUMP ) 

#else 

#define PREFETCH { rA, rB, STRM ) 
#endif 



l 
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^define OLOOP_BIT 6 
/** 

Input parameters 
**/ 

#def±ne X bf r3 
ttdefine Corr bf r4 
#define Ptov map r5 
#define R sump r6 
#define Num_phys_users r7 
/** 

Local GPRs 
**/ 

#define icount r8 
#define ptov count r9 
ttdefine indxl rlO 
ttdefine indx2 rll 
ttdefine indx3 r!2 
ttdefine sindexl rl3 
ttdefine dstp rl4 
#define dst_code rl5 

G4 registers 
**/ 

#define corrOO vO 
ttdefine corrOl vl 
ttdefine corrlO v2 
ttdefine corrll v3 

ttdefine CO 0 v4 

ttdefine CI 0 v5 

#define CO 8 v6 

#define Cl_8 v7 

#define CO 16 v8 

#define CI 16 v9 

ttdefine CO 24 vlO 

ttdefine Cl_24 vll 

ttdefine XO vl2 

ttdefine X8 vl3 

ttdefine X16 vl4 

ttdefine X24 vl5 

ttdefine sumO vl6 
ttdefine suml vl7 
ttdefine zero vl8 

/** 

Begin code text 
**/ 

FUNC PROLOG 

ENTRY_5( gen_R_sums, X_bf, Corr_bf, Ptov_map, R_sump, Num_jpnys_users ) 

CMPWI ( Num_phys_users, 0 ) 

BGT ( start ) 

RETURN 

LABEL ( start ) 



/** 
DST setup 

MAKE_STREAM_CODE_IIR( dst_code, DST_BUMP, 1, 0 ) 
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ADDI( dstp, Corr bf, 80 ) /* start prefetch advanced */ 

PREFETCH ( dstp, dst_code, 0 ) 

/** 

Setup for outer loop entry 
Read and expand two coor vectors 
Set outer loop counter condition 
**/ 

LI ( indxl, 16 ) 
LI ( indx2, 32 ) 
LI ( indx3, 48 ) 
LI ( sindexl, 4 ) 

CMPWI CR( OLOOP_BIT, Num phys_users, 0 ) 

LVX( corrOO, 0, Corr bf ) 

VXOR( zero, zero, zero ) 

LVX( corrOl, Corr bf, indxl ) 

LVX( corrlO, Corr bf, indx2 ) 

LVX( corrll, Corr_bf, indx3 ) 

VUPKHSB( CO 0, corrOO ) 

ADDI ( Corr bf , Corr bf, CORR_BUMP_64 ) 

VUPKLSB( CO 8, corrOO ) 
y, ADDI ( Ptov map, Ptov map, -PTOV_BUMP_l ) 

JZ VUPKHSB( CI 0, corrlO ) 

ADDI ( R sump, R sump, -RSUM_BUMP_8 ) 
CI VUPKLSB ( CI 8, corrlO ) 

J% VUPKHSB( CO 16, corrOl ) 

VUPKLSB ( CO 24, corrOl ) 
!i U VUPKHSB( CI 16, corrll ) 

J] VUPKLSB ( Cl_24, corrll ) 

2! Outer loop for each physical user 

yi **/ 

ft LABEL ( oloop ) 

o . /* { */ 

f; DECR ( Num phys users ) 

yj LBZU ( ptov count, Ptov_map, 1 ) 

1=1= BEQ CR( OLOOP BIT, ret ) 

%== LVX{ X0, 0, X bf ) 

:Jf LVX{ X8, X bf, indxl ) 

U SRWI_C( icount, ptov count, 1 ) 

lis LVX( X16, X bf, indx2 ) 

LVX( X24, X_bf, indx3 ) 

ADDI ( X bf, X bf, X BUMP 64 ) 

CMPWI CR( OLOOP BIT, Num_j>hys_users , 0 ) 

BEQ_MINUS ( one_sum ) 

/** 

Top of sum loop 
Produces two sums each pass 
**/ 

LABEL ( iloop } 
/* { */ 

PREFETCH ( dstp, dst code, 0 ) 
VMSUMSHS ( sumO , CO 0 , X0 , zero ) 
VMSUMSHS( suml, CI 0, X0, zero ) 
LVX( corrOO, 0, Corr bf ) 
LVX( corrOl, Corr bf , indxl ) 
LVX( corrlO, Corr bf, indx2 ) 
VMSUMSHS ( sumO, C0_8 , X8, sumO ) 
DECR C( icount ) 

VMSUMSHS ( suml, CI 8, X8 , suml ) 

LVX( corrll, Corr bf, indx3 ) 

VUPKHSB( CO 0, corrOO } 

VUPKLSB ( CO 8, corrOO ) 

VMSUMSHS ( sumO, CO 16, X16, sumO ) 

VMSUMSHS ( suml, CI 16, XI 6, suml ) 

VUPKHSB( C1_0, corrlO ) 



3 
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ADDI ( R sump, R sump, RSUM_BUMP_8 ) 

VUPKLSB( CI 8, corrlO ) 

VMSUMSHS ( sumO, CO 24, X24, suraO ) 

VUPKHSB( CO 16, corrOl ) 

VMSUMSHS ( suml, CI 24, X24, suml ) 

VUPKLSB( CO 24, corrOl ) 

VUPKHSB( CI 16, corrll ) 

VSUMSWS ( sumO , sumO , zero ) 

VUPKLSB( CI 24, corrll ) 

VSUMSWS ( suml , suml , zero ) 

ADDI ( Corr bf, Corr_bf, CORR_BUMP_64 ) 

VSPLTW( sumO, sumO, 3 ) 

STVEWX( sumO, 0, R sump ) 

VSPLTW( suml, suml, 3 ) 

STVEWX( suml, R_sump, sindexl ) 
/* 1 */ 

BNE ( iloop ) 
/** 

Drop out, check for remainders 

ANDI_C (i count, ptov_count, 0x1) 
BEQ ( oloop ) 

/** 

One more sum: 

Enters and exits with two coor vectors are loaded and expanded to 16 bit 
**/ 

LABEL ( one sum ) 

VMSUMSHS ( sumO , CO 0 , X0 , zero ) 
VMSUMSHS( sumO, CO 8, X8, sumO ) 
ADDI ( R sump, R_sump, RSUM BUMP 8 ) 
VMSUMSHS ( sumO, CO 16, X16, sumO ) 
VMSUMSHS ( sumO, CO 24, X24, sumO ) 
VSUMSWS ( sumO, sumO, zero ) 

VSPLTW( sumO, sumO, 3 ) 
STVEWX( sumO, 0, R sump ) 

ADDI ( R_sump, R_sump, -RSUM_BUMP_4 ) /* pre -dec pointer for loop reentry 
*/ 

Seup for loop re-entry: corrOO consumed in one_sum section 

loop exit ptr v 
corrOO corrlO corrOO corrlO 

corrOO corrlO corrOO corrlO 
loop re-entry ptr 

**/ 

VMR( corrOO, corrlO ) 

LVX( corrlO, 0, Corr bf ) 

VMR( corrOl, corrll ) 

LVX( corrll, Corr bf, indxl ) 

ADDI ( Corrjof, Corr_bf, CORR_BUMP_32 ) 

VUPKHSB( CO 0, corrOO ) 

VUPKLSBf CO 8, corrOO ) 

VUPKHSB( CI 0, corrlO ) 

VUPKLSB( CI 8, corrlO ) 

VUPKHSB{ CO 16, corrOl ) 

VUPKLSB( CO 24, corr 01 ) 

VUPKHSB( CI 16, corrll ) 

VUPKLSB( Cl_24, corrll ) 
/* } */ 

BR ( oloop ) 
/** 

Exit routine 
**/ 

LABEL ( ret ) 

FREE THRU vl8 ( VRSAVE_COND ) 
REST_rl3_rl5 



4 
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gen_ 



■ sums2.mac 



MC Standard Algorithms -- PPC Macro language Version 



File Name: GEN R SUMS2 .MAC 
Description: Multiple small dot product routine for wireless 

group application. 
Entry/params : GEN R SUMS2 (X bf, CorrO bf , Corrl bf, 

Ptov_map, R_sums0, R_sumsl, Num_phys_users) 

Formula : 

num_sums = 0 
for ( i = 0; 
for ( j = 
sum = 0 ; 

for ( k = 0; k < 16; k++ > { 
sumO += (BF32)X bf[k].real 
(BF32)X_bf [k] . imag 



< Num phys users; i++ ) { 

j < (int) Ptov_map [i] ; j++ ) { 



sumO += 

suml += (BF32)X bf [k] .real 
suml += (BF32)X_bf [k] .imag 
++CorrO bf; 
++Corrl_bf ; 



(BF32)CorrO bf->real; 
(BF32) Corr0_bf ->imag; 

(BF32) Corrl bf->real; 
(BF32)Corrl_bf->imag; 



*R sums0++ = sumO; 
*R sumsl++ = suml; 
++num_sums; 

X_bf += N_FINGERS_MAX_SQUARED ; 

Mercury Computer Systems, Inc. 
Copyright (c) 200 0 All rights 

ivision Date Engineer Reason 



fpl 
fpl 



Created 

Fixed zero bug 



#include "salppc.inc" 



#if DO IO 

#define PTOV BUMP 1 
CORR BUMP 32 
CORR BUMP_64 
X BUMP 64 
RSUM BUMP 
RSUM_BUMP 



#defir 
#define 
#define 
#defi 
#def ine 
#else 
ttdefine 
#def ine 
#define 
#define 
#def ine 
#def ine 
#endif 



64 



PTOV BUMP 1 i 
CORR BUMP 32 
CORR BUMP_64 
X BUMP 64 0 
RSUM BUMP 8 
RSUM BUMP_4 



#define LOAD_CORR ( vT, rA, rB ) LVX ( vT, rA, rB ) 
#define DSTJ3UMP CORR_BUMP_64 
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DST( rA, rB, STRM ) \ 
ADDI ( rA, rA, DST_BUMP ) 
itelse 

^define PREFETCH { rA, rB, STRM ) 
flendif 

#define OLOOP_BIT 6 
/** 

Input parameters 
**/ 

#define X bf r3 
#define CorrO bf r4 
#define Corrl bf r5 
#define Ptov map r6 
#define R sumpO r7 
#define R sumpl r8 
#define Num_johys_users r9 
/** 

Local GPRs 
**/ 

#define icount rlO 
#define ptov count rll 
#define indxl rl2 
#define indx2 rl3 
#define indx3 rl4 
#define sindexl rl5 
#define dstp r!6 
ttdefine dst code rl7 
#define dst_stride indx3 
/** 

G4 registers 
**/ 

ttdefine corrOO vO 
ttdefine corrOl vl 
ttdefine corrlO v2 
ttdefine corrll v3 
ttdefine corr20 v4 
ttdefine corr21 v5 
ttdefine corr30 v6 
ttdefine corr31 corrOO 
ttdefine zero v7 

ttdefine CO 0 v8 
ttdefine CI 0 v9 
ttdefine C2 0 vlO 
ttdefine C3_0 vll 

ttdefine CO 8 vl2 
ttdefine CI 8 vl3 
ttdefine C2 8 vl4 
ttdefine C3_8 v!5 

ttdefine CO 16 vl6 
ttdefine CI 16 vl7 
ttdefine C2 16 vl8 
ttdefine C3_16 vl9 

ttdefine CO 24 v2 0 

ttdefine CI 24 v21 

ttdefine C2 24 v22 

ttdefine C3_24 v2 3 

ttdefine X0 v24 
ttdefine X8 v2 5 
ttdefine X16 v26 
ttdefine X24 v27 
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#define sumO v28 
#define suml v2 9 
#define sum2 v3 0 
^define sum3 v31 
/** 

Begin code text 
**/ 

FUNC_PROLOG 

iif 1 . 

NO p /****** alignment may be important ******/ 

#endif 

ENTRY_7( gen R sums 2 , X_bf, Corr0_bf, Corrl_bf, Ptov_map, R_sump0, R_sumpl, 
Num__phys_users ) 

CMPWK Num_phys_users, 0 ) 

BGT ( start ) 

RETURN 



LABEL { start ) 
• j SAVE rl3 rl7 

USE_THRU_v31( VRSAVE_COND ) 

U /** 

DST setup 

'T: SUB( dst stride, Corrl bf, CorrO bf ) 

W MAKE STREAM_CODE IIR{ dst code, DST_BUMP, 2, dst_stride ) 

sfl ADDK dstp, CorrO bf, 80 ) /* start prefetch advanced */ 

/* 48: 1087, 64: 1094, 80: 1043, 96: 1058, 112: 1049, 128: 1061 */ 
PREFETCH ( dstp, dst_code, 0 ) 

„ /** 

q Setup for outer loop entry 

T*. Read and expand two coor vectors 

UJ Set outer loop counter condition 

pk **/ 
f* LI( indxl, 16 ) 

T LI ( indx2, 32 ) 

P LI( indx3, 48 ) 

11 j LI { sindexl, 4 ) 

CMPWI_CR( OLOOP_BIT, Num__phys_users, 0 ) 

LOAD CORRC corrOO, 0, CorrO bf ) 
LOAD CORR( corrlO, CorrO bf, indx2 ) 
ADDK Ptoy_map, Ptov map, -PTOV BUMP__1 ) 
LOAD CORR( corr20, 0, Corrl bf ) 
ADDK R sumpO, R sumpO, -RSUM BUMP_8 ) 
LOAD_CORR( corr30, Corrl_bf, indx2 ) 

LOAD CORR( corrOl, CorrO bf, indxl ) 
ADDI( R sumpl, R sumpl, -RSUM BUMP_8 ) 
LOAD CORR( corrll, Corr0_bf, indx3 ) 
VXOR( zero, zero, zero ) 
LOAD_CORR( corr21, Corrl_bf, indxl ) 

VUPKHSB( CO 0, corrOO ) 

ADDK CorrO bf, Corr0_bf, CORR_BUMP_64 ) 
VUPKHSB( CI 0, corrl 0 ) 
VUPKHSB( C2 0, corr2 0 ) 
VUPKHSB( C3_0, corr3 0 ) 

VUPKLSB( CO 8, corrOO ) , 

LOAD CORR( corr31, Corrl_bf, indx3 ) /* corrOO, corr31 same register */ 

VUPKLSB( CI 8, corrlO ) 

ADDK Corrl_bf, Corrl_bf, CORR_BUMP_64 ) 



3 
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VUPKHSB( CO 16, corrOl ) 

VUPKHSB( CI 16, corrll ) 

VUPKHSBf C2 16, corr21 ) 

VUPKHSB( C3_16, corr31 ) 

VUPKLSB( CO 24, corrOl ) 
VUPKLSB( CI 24, corrll ) 
VUPKLSB( C2 24, corr21 ) 
VUPKLSB( C3_24, corr31 ) 
/** 

Outer loop for each physical user 
**/ 

LABEL ( oloop ) 
/* { */ 

DECR ( Num phys users ) 

LBZU( ptov count, Ptov_map, PTOV_BUMP_l ) 

BEQ CR( OLOOP BIT, ret ) 

LVX( X0, 0, X bf ) 

LVX( X8, X bf, indxl ) 

SRWI_C{ icount, ptov count, 1 ) 

LVX( X16, X bf, indx2 ) 
p LVX( X24, X_bf, indx3 ) 

fi ADDI( X bf, X bf, X BUMP 64 ) 

"'Z CMPWI CR( OLOOP BIT, Nura_phys_users , 0 ) 

f «M BEQ_MINUS ( one_sum ) 

€l /** 
i.f| Top of sum loop 

;S Produces four sums each pass 

:» **/ 
US LABEL ( iloop ) 

/* { */ 

PREFETCH ( dstp, dst code, 0 ) 
LOAD CORR( corrOO, 0, Corr0_bf ) 
DECR C( icount ) 

LOAD CORR( corrlO, CorrO bf, indx2 ) 
VMSUMSHS( sumO, C0_0 , X0, zero ) 
LOAD CORR( corr20, 0, Corrl bf ) 
VMSUMSHS{ suml, C1_0, X0 , zero ) 
LOAD CORR( corr30, Corrl bf, indx2 ) 
LOAD CORR( corrOl, CorrO bf, indxl ) 
VMSUMSHS( sum2, C2_0 , XO, zero ) 
LOAD CORR( corrll, CorrO bf, indx3 ) 
LOAD CORR( corr21, Corrl bf, indxl ) 
VMSUMSHS{ sum3, C3 0, X0, zero ) 
VUPKHSB( CO 0, corrOO ) 
VMSUMSHS ( sumO, CO 8, X8 , sumO ) 
VUPKHSB( CI 0, corrlO ) 

ADD I ( R sumpO , R sumpO , RSUM_BUMP_8 ) 
VUPKHSB( C2 0, corr20 ) 
VMSUMSHS ( suml , CI 8 , X8 , suml ) 
VUPKHSB( C3 0, corr3 0 ) 
VMSUMSHS ( sum2, C2 8, X8 , sum2 ) 
VUPKLSB( CO 8, corrOO ) 
VMSUMSHS ( sum3, C3 8, X8 , sum3 ) 
ADDI ( CorrO bf , CorrO bf, CORR BUMP 64 ) 

LOAD CORR( corr31, Corrl_bf, indx3 ) /* corrOO, corr31 same register */ 
VUPKLSB( CI 8, corrlO ) 
VMSUMSHS ( sumO, CO 16, XI 6, sumO ) 
VUPKLSB( C2 8, corr2 0 ) 
VMSUMSHS ( suml, CI 16, X16 , suml } 
VUPKLSB( C3 8, corr3 0 ) 

ADDI ( R sumpl, R sumpl , RSUM_BUMP_8 ) 
VUPKHSB( CO 16, corrOl ) 
VMSUMSHS ( sum2, C2_16, XI 6, sum2 ) 



□ 
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y 



VUPKHSB( CI 16 
VMSUMSHS( sum3 
VUPKHSB( C2 16 
VMSUMSHS( sumO 
ADD I ( Corrl bf 
VMSUMSHS ( suml 
VUPKHSB( C3 16 
VMSUMSHS ( sum2 
VUPKLSB( CO 24 
VMSUMSHS( sum3 
VSUMSWS( suraO, 
VUPKLSB( CI 24 
VSUMSWS( suml, 
VUPKLSBf C2 24 
VSUMSWS( sum2, 
VUPKLSB( C3 24 
VSPLTW( sumO, 
VSUMSWS( sum3 ( 
VSPLTW( suml, 
STVEWX( sumO, 
VSPLTW( sum2, 
STVEWX( suml, 
VSPLTW( sum3, 
STVEWX ( sum2 , 
STVEWX( sum3, 
} */ 

BNE ( iloop ) 



corrll ) 

C3 16, X16, sum3 ) 
corr21 ) 

CO 24, X24, sumO ) 
Corrl bf , CORR BUMP_64 ) 
CI 24, X24, suml ) 



corr31 ) 
C2 24, X24, sum2 
corrOl ) 

C3 24, X24, sum3 
sumO, zero ) 

corrll ) 
suml, zero ) 

corr21 ) 
sum2 , zero ) 
corr31 ) 
sumO , 3 ) 

sum3, zero ) 
suml, 3 ) 
0, R sumpO ) 
sum2, 3 ) 

R sumpO, s index 1 ) 
sum3, 3 ) 
0, R sumpl ) 
R_sumpl, sindexl ) 



/** 

Drop out, check for remainders 
**/ 

ANDI_C(icount, ptov_count, 0x1) 
BEQ ( oloop ) 



/* 



; with two coor vectors are loaded and expanded to 16 bit 



LABEL ( one sum ) 

VMSUMSHS ( sumO, CO 0, X0, zero ) 
ADDK R sumpO, R sumpO , RSUM BUMP_8 ) 
VMSUMSHS ( sum2, C2 0, X0, zero ) 
ADDI( R sumpl, R sumpl, RSUM BUMP_8 ) 
VMSUMSHS ( sumO , CO 8 , X8 , sumO ) 
VMSUMSHS ( sum2 , C2 8 , X8 , sum2 ) 
VMSUMSHS ( sumO, CO 16, X16, sumO ) 
vMSUMSHS ( sum2, C2 16, X16, sum2 ) 
VMSUMSHS ( sumO, CO 24, X24, sumO ) 
VMSUMSHS ( sum2, C2 24, X24, sum2 ) 
VSUMSWS ( sumO, suraO , zero ) 
VSUMSWS( sum2, sum2 , zero ) 

VSPLTW( sumO, sumO, 3 ) 

STVEWX ( sumO, 0, R sumpO ) 

VSPLTW( sum2, sum2, 3 ) 

STVEWX ( sum2, 0, R sumpl ) 

ADDK R_sump0, R_sump0, -RSUM_BUMP_4 ) 

reentry */ 

ADDK R_sumpl, R_sumpl, -RSUM_BUMP_4 ) 

/ 



/* pre-dec pointers for loop 



Setup for loop re-entry: corrOO consumed in one_sum section 

exit ptr v 
corrOO corrlO corrOO corrlO 

corrOO corrlO corrOO corrlO 
re-entry ptr 

^VMR( corr21, corr31 ) /* corrOO, corr31 same register */ 
VMR( corrOO, corrlO ) 
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LOAD_CORR( corrlO, 0, CorrO_bf ) 
VMR( corrOl, corrll ) 

LOAD_CORR( corrll, Corr0_bf, indxl ) 
VMR{ corr20, corr30 ) 
LOAD_CORR( corr30, 0, Corrl_bf ) 

VUPKHSB( CO 0, corrOO ) 

VUPKLSB( CO 8, corrOO ) 

LOAD CORR( corr31, Corrl_bf, indxl ) /* corrOO, corr31 same register */ 

VUPKHSB( Cl 0, corrlO ) 

VUPKLSB( Cl 8, corrlO ) 

VUPKHSB( C2 0, corr2 0 ) 

VUPKLSB( C2 8, corr20 ) 

VUPKHSB( C3 0, corr30 ) 

VUPKLSB( C3_8, corr30 ) 

VUPKHSB( CO 16, corrOl ) 

ADDI ( CorrO bf, CorrO bf , CORR_BUMP_32 ) 

VUPKLSB( CO 24, corrOl ) 

ADDI ( Corrl bf, Corrl bf, CORR_BUMP_32 ) 

VUPKHSB( Cl 16, corrll ) 

VUPKLSB( Cl 24, corrll ) 

VUPKHSB( C2 16, corr21 ) 

VUPKLSB( C2 24, corr21 ) 

VUPKHSB( C3 16, corr31 ) 

VUPKLSB( C3_24, corr31 ) 
/* } */ 

BR ( oloop ) 




LABEL ( ret ) 

FREE THRU v31 ( 
REST rl3_rl7 
RETURN 



/** 

Exit routine 



**/ 



VRSAVE_COND ) 



FUNC_EPILOG 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: GEN X ROW. MAC 

Description: 2 Complex scalers (4x1) 2 complex vectors (4xN) 
16 bit complex multiplication producing a 16 
bit complex vector of length 16 *N. 

Entry/params: GEN_X_ROW (Al , A2 , C, Phys_index, N) 

Formula: 

for ( i = 0; i < tot_phys_users ; i++ ) { 



in mpathlp 
in_mpath2p 



: mpathl bf 
■. mpath2_bf 



( i * N FINGERS MAX) ; 
(i * N_FINGERS_MAX) ; 



for ('ql = 0; ql < N_FINGERS _MAX ,- ql++ ) { 



Sir 
sli 
s2r 
s2i 



(BF32)out mpathlp [ql] .real ; 
(BF32)out mpathlp [ql] . imag; 
(BF32)out mpath2p[ql] .real; 
(BF32) out_mpath2p [ql] . imag; 



. = 0; q 



N_FINGERS_MAX ; q++ ) { 



air = (BF32)in mpathlp [q] . real ; 

ali = (BF32) in mpathlp [q] -imag; 

a2r = (BF32)in mpath2p [q] . real ; 

a2i = (BF32) in_mpath2p [q] .imag; 

cr = (air * sir) + (ali * sli) ; 
ci = (air * sli) - (ali * sir) ; 
cr += (a2r * s2r) + (a2i * s2i) ; 
ci += (a2r * s2i) - (a2i * s2r) ; 



X_bf [i 
X_bf [i 
++j ; 



N FINGERS MAX_SQUARED + j ] . real 

= (BF16) (cr 
I MAX SQUARED + j] .imag 

= (BF16) (ci 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision Date 
0.0 000907 



Engineer Reason 

fpl Created 



^include "salppc.inc" 

#define LOG N FINGERS MAX 2 

tfdefine LOG ELEMENT_SI ZE 2 

#define INDEX_SHIFT ( LOG_N_FI NGERS_MAX • 



LOG_ELEMENT_SI ZE ) 



Local read-only Permute vector table 



RODATA SECTION ( 6 ) 
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START__L_ARRAY ( local_table ) 

L_PERMUTE_MASK( 0x02031011, 0x06071415, 0x0a0bl819, OxOeOflcld ) 

^32 -> 16 bit: select the 16 MSBs of each 32 bit field 
L_PERMUTE_MASK ( 0x00011011, 0x04051415, 0x08091819, OxOcOdlcld ) 
END_ARRAY 
/** 

API registers 
**/ 

#define Al r3 
#define A2 r4 
ttdefine C r5 
# define Phys_index r6 
ttdefine N r7 

Integer loop registers 
**/ 

ttdefine CpO C 
ttdefine Cpl r8 
ttdefine sptrl r8 
ttdefine Cp2 r9 
#define sptr2 r9 
ttdefine Cp3 rlO 
#define tptr rlO 
ttdefine cindex rll 
ttdefine aindex rl2 
ttdefine index rl2 

/** 

G4 registers 
**/ 

ttdefine crOO vO 
ttdefine crOl vl 
ttdefine cr02 v2 
ttdefine cr03 v3 

ttdefine vtmpO vO 
ttdefine vtmp2 v2 

ttdefine ciOO v4 
ttdefine ciOl v5 
ttdefine ci02 v6 
ttdefine ci03 v7 

ttdefine srOO v8 
ttdefine srOl v9 
ttdefine sr02 vlO 
ttdefine sr03 vll 

ttdefine siOO vl2 

ttdefine siOl vl3 

ttdefine si02 vl4 

ttdefine si03 vl5 

ttdefine srlO vl6 

ttdefine srll vl7 

ttdefine srl2 vl8 

ttdefine srl3 vl9 

ttdefine silO v20 
ttdefine sill v21 
ttdefine sil2 v22 
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#def ine 


sil3 v23 


#def ine 


cO v24 


#define 


cl v24 


#def ine 


c2 v25 


#define 


C3 v26 


#def ine 


aOO v27 


#def ine 




#define 


aOl v28 


#def ine 


all v29 


#def ine 


sval v28 


#def ine 


neg_sval v29 


#def ine 


vc v3 0 


ijdef ine 


zero v31 



Begin code text 
**/ 

FUNC PROLOG 

ENTRY 5( gen X row, Al, A2, C, Phys_index, N ) 

USE_THRU_v31 ( VRSAVE_COND ) 
/** 

Load up complex scaler 

sval = srO siO srl sil sr2 si2 sr3 si3 

LA ( tptr, local table, 0 ) 
VXOR( zero, zero, zero ) 
LI (index, 0) 

Byte offset into 16 bit complex vector 

/ SLWI ( Phys index, Phys index, INDEX_SHIFT ) 
ADD ( sptrl, Al, Phys index ) 
ADD ( sptr2, A2, Phys_index ) 

/** 

Load up first scaler: 

if sval = sr0,si0 srl, sil sr2,si2 sr3,si3 
= sO si s2 s3 

** / 

LVX( sval, sptrl, index ) /* read 4 IS bit complex values */ 
VSUBSHSf neg sval, zero, sval ) /* negate complex scaler values */ 
VMRGHW(vtmpO, sval, sval) /* vtmpO = sO sO si si */ 

VMRGLW (vtmp2 , sval, sval) /* vtmp2 = s2 s2 s3 s3 */ 

VMRGHW(srOO, vtmpO , vtmpO) /* srO = sO sO sO sO */ 
VMRGLW (sr 01, vtmpO, vtmpO) /* srl = si si si si */ 
VMRGHW(sr02, vtmp2 , vtmp2) /* sr2 = s2 s2 s2 s2 */ 
VMRGLW (sr03, vtmp2, vtmp2) /* sr3 = s3 s3 s3 s3 */ 

if neg sval = sr0,si0 srl, sil sr2,si2 sr3,si3 
after perm: 

= si0,-sr0 sil, -srl si2,-sr2 si3,-sr3 

= nsO nsl ns2 ns3 

**/ 

LVX( vc, tptr, index ) 

VPERM( neg sval, sval, neg sval, vc ) /* si -sr */ 
VMRGHW (vtmpO , neg sval, neg sval) /* vtmpO = nsO nsO nsl nsl */ 
VMRGLW (vtmp2 , neg sval, neg sval) /* vtmp2 = ns2 ns2 ns3 ns3 */ 
VMRGHW(si00, vtmpO, vtmpO) /* siO = nsO nsO nsO nsO */ 
VMRGLW(si01, vtmpO, vtmpO) /* sil = nsl nsl nsl nsl */ 
VMRGHW(si02, vtmp2 , vtmp2) /* si2 = ns2 ns2 ns2 ns2 */ 
VMRGLW(si03, vtmp2, vtmp2) /* si3 = ns3 ns3 ns3 ns3 */ 

/** 

Load up second scaler: 
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LVX( sval, sptr2, index ) /* read 4 16 bit complex values */ 



ADDI (index, index 
VSUBSHS( neg sval 
VMRGHW ( vtmp 0 , sval, 
VMRGLW (vtmp2 , sval, 
VMRGHW (srlO, vtmpO, 
VMRGLW (srll, vtmpO, 
VMRGHW (srl2, vtmp2, 
VMRGLW (sr 13 , vtmp2, 



16) 

zero, sval ) /* negate complex 
sval) 
sval) 
vtmpO) /■ 
vtmpO) /- 
vtmp2) /' 
vtmp2) /■ 



vtmpO = 
/* vtmp2 = s2 s2 
srO = sO sO sO sO 
srl = si si si si 
sr2 = s2 s2 s2 s2 
sr3 = S3 s3 s3 s3 



scaler 1 

7 



S3 S3 
*/ 
*/ 
*/ 
*/ 



VPERM( neg sval, sval, neg sval, vc ) /* si -sr */ 
VMRGHW (vtmpO, neg sval, neg sval) /* vtmpO = nsO nsO nsl nsl */ 
VMRGLW {vtmp 2 , neg sval, neg sval) /* vtmp2 = ns2 ns2 ns3 ns3 */ 



/** 



VMRGHW { 
VMRGLW (sill, 
VMRGHW (sil2, 
VMRGLW (sil3, 



vtmpO 
vtmpO , 
vtmp 2 , 
vtmp2 , 



7tmp0) /* 

vtmpO) /* 

vtmp2) /* 

vtmp2) /* 



nsO nsO nsO nsO */ 

nsl nsl nsl nsl */ 

ns2 ns2 ns2 ns2 */ 

ns3 ns3 ns3 ns3 */ 



Assign loop pointers and index registers: 
Loop permute control vector assumes 16 bit input vectors 
C[] -> 16 x N complex elements 
A[] -> 4 x M complex elements 

N -> 4 byte (i.e. interleaved complex) elements 
**/ 



LVX( vc, tptr 
LKaindex, 0) 
LI(cindex, 0) 
ADDI ( Cpl , 
ADDI ( Cp2 , 
ADDI ( Cp3 , 



index ) /* interleaves 16 MSBs of real, imaginary */ 



C, 16 ) 
C, 32 ) 
C, 48 ) 



Start up loop code: 
Each read on A[] brings 
**/ 



. 4 complex input values 



LVX( aOO, Al, aindex ) 
DECR_C(N) 

LVX( aOl, A2, aindex ) 
ADDI (aindex, aindex, 16) 



VMSUMSHS ( 


crOO , 


srOO, 


aOO, 


zero ) 


VMSUMSHS ( 


ciOO, 


siOO , 


aOO, 


zero ) 


VMSUMSHS ( 


crOl, 


srOl, 


aOO, 


zero ) 


VMSUMSHS ( 


ciOl, 


siOl, 


aOO, 


zero ) 


VMSUMSHS ( 


cr02, 


sr02, 


aOO, 


zero ) 


VMSUMSHS ( 


ci02, 


si02, 


aOO, 


zero ) 


VMSUMSHS ( 


cr03, 


sr03. 


aOO, 


zero ) 


VMSUMSHS ( 


ci03, 


si03. 


aOO, 


zero ) 


BEQ ( dol 










DECR C(N) 










LVX( alO, 


Al, a 


index 


/* 


read input 


VMSUMSHS ( 


crOO , 


srlO, 


aOl, 


crOO ) 


VMSUMSHS ( 


ciOO, 


silO, 


aOl, 


ciOO ) 


LVX( all, 


A2, aindex 






VMSUMSHS ( 


crOl, 


srll, 


aOl, 


crOl ) 


BR( mid loopO ) 








/** 










Top of double loop 






**/ 










LABEL ( loopO 


) 








/* { */ 










VMSUMSHS ( 


crOO, 


srOO, 


aOO, 


zero ) 


VMSUMSHS ( 


ciOO, 


siOO, 


aOO, 


zero ) 


VPERM( C2 


, cr02 


, ci02, vc 


> 


STVX( Cl, 


Cpl, 


c index 


) 




VMSUMSHS ( 


crOl, 


srOl, 


aOO, 


zero ) 
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DECR C(N) 

VMSUMSHS ( ciOl, siOl, aOO, zero ) 
VMSUMSHS( cr02, sr02, aOO, zero ) 
VMSUMSHS ( ci02, si02, aOO, zero ) 
VPERM( c3, cr03, 0103, vc ) 
STVX( c2, Cp2, cindex ) 
VMSUMSHS( cr03, sr03, aOO, zero ) 
VMSUMSHS ( ci03, si03, aOO, zero ) 

LVX( alO, Al, aindex ) /* read input for next pass */ 
VMSUMSHS( crOO, srlO, aOl, crOO ) 
VMSUMSHS( ciOO, silO, aOl, ciOO ) 
LVX( all, A2, aindex ) 
STVX( c3, Cp3, cindex ) 
VMSUMSHS{ crOl, srll, aOl, crOl ) 
ADDI (cindex, cindex, 64) 
LABEL ( mid loopO ) 

VMSUMSHS( ciOl, sill, aOl, ciOl ) 
VMSUMSHS ( cr02, srl2, aOl, cr02 ) 

VPERM( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 

STVX< cO, CpO, cindex ) /* begin write cycle from last pass */ 

VMSUMSHS ( ci02, sil2, aOl, ci02 ) 

ADDI (aindex, aindex, 16) 

VMSUMSHS ( cr03, srl3, aOl, cr03 ) 
if VMSUMSHS ( ci03, sil3, aOl, ci03 ) 

O VPERM( cl, crOl, ciOl, vc ) 

D /* ) */ 

BNE ( loopl ) 

m /** 

\14 Drop out to flush 

VMSUMSHS ( crOO, srOO, alO, zero ) 
^ VMSUMSHS ( ciOO, siOO, alO, zero ) 

If! VPERM( c2, cr02, ci02, vc ) 

,„ STVX( cl, Cpl, cindex ) 

VMSUMSHS ( crOl, srOl, alO, zero ) 
M VMSUMSHS ( ciOl, siOl, alO, zero ) 

|jj VMSUMSHS ( cr02, sr02, alO, zero ) 

12 VMSUMSHS ( ci02, si02, alO, zero ) 

%I VPERM( c3, cr03, ci03, vc ) 

4= STVX( c2, Cp2, cindex ) 

O VMSUMSHS ( cr03, sr03, alO, zero ) 

p'i VMSUMSHS ( ci03, si03, alO, zero ) 

VMSUMSHS ( crOO, srlO , all, crOO ) 

VMSUMSHS ( ciOO, silO, all, ciOO ) 

STVX( c3, Cp3, cindex ) 

VMSUMSHS ( crOl, srll, all, crOl ) 

ADDI (cindex, cindex, 64) 

VMSUMSHS ( ciOl, sill, all, ciOl ) 

VMSUMSHS ( cr02, srl2 , all, cr02 ) 

VPERM( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass / 
STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 
VMSUMSHS ( ci02, sil2, all, ci02 ) 
VMSUMSHS ( cr03, srl3, all, cr03 ) 
VMSUMSHS( ci03, sil3, all, ci03 ) 
VPERM( cl, crOl, ciOl, vc ) 

VPERM( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VPERM{ C3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
STVX( c3, Cp3, cindex ) 
BR ( ret ) 

/** 

Top of second loop 
**/ 

LABEL ( loopl ) 
/* { */ 
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VMSUMSHS ( crOO, srOO, alO, zero ) 
VMSUMSHS( ciOO, siOO, alO, zero ) 
VPERM ( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VMSUMSHS ( crOl, srOl, alO, zero ) 
DECR C(N) 

VMSUMSHS( ciOl, siOl, alO, zero ) 
VMSUMSHS( cr02, sr02, alO, zero ) 
VMSUMSHS( ci02, si02, alO, zero ) 
VPERM( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
VMSUMSHS( cr03, sr03, alO, zero ) 
VMSUMSHS ( ci03, si03, alO, zero ) 

LVX{ aOO, Al, aindex ) /* read input for next pass */ 

VMSUMSHS( crOO, srlO, all, crOO ) 

VMSUMSHS( ciOO, silO, all, ciOO ) 

LVX( aOl, A2, aindex ) 

STVX( c3, Cp3, cindex ) 

VMSUMSHS ( crOl, srll, all, crOl ) 

ADDI (cindex, cindex, 64) 

VMSUMSHS ( ciOl, sill, all, ciOl ) 

VMSUMSHS ( cr02, srl2, all, cr02 ) 

VPERM ( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 

STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 

VMSUMSHS ( ci02, sil2, all, ci02 ) 

ADDI (aindex, aindex, 16) 

VMSUMSHS ( cr03, srl3, all, cr03 ) 

VMSUMSHS ( ci03, sil3 , all, ci03 ) 

VPERM ( cl, crOl, ciOl, vc ) 
/* } */ 

BNE ( loopO ) 
/** 
Flush loop 

VMSUMSHS ( crOO, srOO, aOO, zero ) 
VMSUMSHS ( ciOO, siOO, aOO, zero ) 
VPERM ( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VMSUMSHS ( crOl, srOl, aOO, zero ) 
VMSUMSHS ( ciOl, siOl, aOO, zero ) 
VMSUMSHS ( cr02, sr02 , aOO, zero ) 
VMSUMSHS ( ci02, si02, aOO, zero ) 
VPERM( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
VMSUMSHS ( cr03, sr03, aOO, zero ) 
VMSUMSHS ( ci03, si03, aOO, zero ) 
VMSUMSHS ( crOO, srlO, aOl, crOO ) 
VMSUMSHS ( ciOO, silO, aOl, ciOO ) 
STVX( c3, Cp3, cindex ) 
VMSUMSHS ( crOl, srll, aOl, crOl ) 
ADDI (cindex, cindex, 64) 
VMSUMSHS ( ciOl, sill, aOl, ciOl ) 
VMSUMSHS ( cr02, srl2, aOl, cr02 ) 

VPERM ( cO, crOO, ciOO, vc ) /* begin permute cycle for thxs pass */ 
STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 
VMSUMSHS ( ci02, sil2, aOl, ci02 ) 
VMSUMSHS ( cr03, srl3 , aOl, cr03 ) 
VMSUMSHS ( ci03, sil3, aOl, ci03 ) 
VPERM( cl, crOl, ciOl, vc ) 

VPERM ( c2, cr02, ci02, vc ) 
STVX( cl, Cpl, cindex ) 
VPERM ( c3, cr03, ci03, vc ) 
STVX( c2, Cp2, cindex ) 
STVX( c3, Cp3, cindex ) 
BR ( ret ) 
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LABEL ( dol ) 

VMSUMSHS ( crOO, srlO, aOl, crOO ) 
VMSUMSHS( ciOO, silO, aOl, ciOO ) 
VMSUMSHS( crOl, srll, aOl, crOl ) 
VMSUMSHS ( ciOl, sill, aOl, ciOl ) 
VMSUMSHS ( cr02, srl2 , aOl, cr02 ) 

VPERM( cO, crOO, ciOO, vc ) /* begin permute cycle for this pass */ 

STVX( cO, CpO, cindex ) /* begin write cycle from last pass */ 

VMSUMSHS ( ci02 ( sil2 , aOl, ci02 ) 

VMSUMSHS ( cr03, srl3, aOl, cr03 ) 

VMSUMSHS ( ci03, sil3 , aOl, ci03 ) 

VPERM( cl, crOl, ciOl, vc ) 

VPERM( c2, cr02, ci02, vc ) 

STVX( cl, Cpl, cindex ) 

VPERM( c3, cr03, ci03, vc ) 

STVX( c2, Cp2, cindex ) 

STVX( c3, Cp3, cindex ) 

/** 

Return 
**/ 

LABEL ( ret ) 

FREE THRU_v31( VRSAVE_COND ) 



1^ 

Q 



RETURN 
FUNC_EPILOG 
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#include "raudlib.h" 

! * Return the offset in units of complex elements into the CorrO matrix 

* corresponding to a specified starting physical user and starting virtual 

* user (within the starting physical user) pair. 

int mudlib get CorrO offset ( 

unsigned char *ptov_map, /* no more than 256 virts . per phys */ 



int 



/* typically, 4 */ 



int tot_virt_users, /* sum of ptovjnap over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_jphys_user] 

*/ 

) 

int num_Corrs, num_virt_users; 

num virt users = mudlib_get_num_virt_users ( ptov_map, 0, 0, 
start_phys user, 

— start_virt_user ) - 1 ; 

num Corrs = (num virt users * tot virt users) - 

( (num_virt_users * (num_virt_users + 1)) / 2) ,- 

return ( num_Corrs * (num_fingers * num_fingers) ) ,- 

} 

/ * Return the size (in bytes) of the portion of the CorrO matrix 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of CorrO are assumed 

* to be of type C0MPLEX_BF8 . 

int mudlib get CorrO size ( 

:igned char *ptov_map, /* no more than 256 varts. per phys */ 
/* typically, 4 */ 

/* sum of ptov_map over all phys users 



int 

int 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_user] 

int end phys user, /* zero-based index into ptov map */ 
int ' ' " ' 



/* must be < ptov__map [end_phys_user] */ 



int start_of fset, end_offset; 

start offset = mudlib_get_CorrO_of f set ( ptov map, 

— num fingers, 

tot virt users, 
start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER ( ptovjnap, end_jphys_user , end_virt_user ) 

end offset = mudlib_get_CorrO_of f set ( ptov map, 

— num fingers, 

tot virt users, 
end phys user, 
end_virt_user ) ; 

return ( (end_offset - start_of f set) * sizeof (C0MPLEX_BF8) ); 



1 
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* Return the offset in units of complex elements into the Corrl matrix 

* corresponding to a specified starting physical user and starting virtual 

* user (within the starting physical user) pair. 

int mudlib get Corrl offset { 

unsigned char *ptov_map, /* no more than 256 virts . per phys */ 

int num fingers, ~ /* typically, 4 */ 

int tot virt_users, /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start__virt_user /* must be < ptov_map [start_phys_user] 

*/ 



int num_Corrs, num_virt_users; 

num virt users = mudlib_get_num_virt_users ( ptov_map, 0, 0, 
startphys user, 

~~ start_virt_user ) - 1; 

num_Corrs = (num_virt_users * tot_virt_users) ; 

return ( num_Corrs * (num_fingers * num_fingers) ) ; 

} 

/ '* Return the size (in bytes) of the portion of the Corrl matrix 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of Corrl are assumed 

* to be of type C0MPLEX_BF8 . 

V • , 

int mudlib get Corrl size ( 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 

int num fingers, /* typically, 4 */ 

int tot_virt_users, /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_user] 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end_phys_user] */ 



) 

int start_of f set, end_offset; 

start offset = mudlib_get_Corrl_of f set ( ptov map, 

num fingers, 
tot virt users, 
start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT__USER( ptov_map, end_phys_user , end_virt_user ) 

end offset = mudlib_get_Corrl_of f set ( ptov map, 

num fingers, 
tot virt users, 
end phys user, 
end_virt_user ) ; 

return ( (end_offset - start_of f set) * sizeof (C0MPLEX_BF8) ) ; 

} 

1 * Return the offset into the R0 matrix corresponding to a specified 

* starting physical user and starting virtual user (within the 

* starting physical user) pair. 

2 
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*/ 

int mudlib get R0 offset ( 

unsigned char *ptoy_map, /* no more than 256 virts . per phys */ 

int tot_virt_users , /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 



*/ 
) 

int i, num_virt_users, offset, tcols? 



tools = (tot virt users + R MATRIX ALIGN MASK) & ~R MATRIX AL I GN_MAS K ; 
num virt users = mudlib_get_num_virt_users ( ptovjnap, 0, 0, 
start phys user, , 

- start_virt_user ) - 1; 

offset = 0; 

for ( i = 0; i < num_virt_users ; i++ ) 

offset += (tcols - (i & ~R_MATRIX_ALIGN_MASK) ) ; 
return offset; 



} 

' ' * Return the size (in bytes) of the portion of the R0 matrix 

* corresponding to a specified starting physical user, virtual 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of R0 are assumed 

* to be of type BP8. 

int mudlib get R0 size ( . 

unsigned char *ptov_map, /* no more than 256 virts . per phys */ 
int tot_virt_users , /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptovjnap [start_phys_user] 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end_phys_userj */ 

) 

int start_of fset, end_offset; 

start offset = mudlib_get_R0_of f set ( ptov map, 

- tot virt users, 

start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER ( ptovjnap, end_phys_user , end_virt_user ) 

end_offset ; 



return ( (end_offset - start_of f set) * sizeof(BF8) ); 

} 

/ * Return the offset into the Rl matrix corresponding to a specified 

* starting physical user and starting virtual user (within the 

* starting physical user) pair. 

int mudlib qet Rl offset ( . 

unsigned char *ptov_map, /* no more than 256 virts per phys */ 
int tot_virt_users, /* sum of ptov_map over all phys users 

int start _jphys_user, /* zero-based index into ptov_map */ 
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2/23/2001 

start_virt_user /* must be < ptov_map [start_phys_user] 



int num_virt_users, tcols; 

tcols = (tot virt users + R MATRIX ALIGN MASK) & ~R MATRIX ALIGN_MASK; 
num virt users = mudlib_get_num_virt_users { ptov_map, 0, 0, 
start_phys user, 

start_virt_user ) - 1; 

return ( num_virt_users * tcols ) ; 

} 

* Return the size (in bytes) of the portion of the Rl matrix 

* corresponding to a specified starting physical user, virtual _ 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. Elements of Rl are assumed 

* to be of type BF8 . 

int mudlib get Rl size ( . 

unsigned char *ptoy_map, /* no more than 256 virts. per pnys */ 
int tot_virt_users, /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start virt_user, /* must be < ptov_map [start_phys_user] 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end__phys_user] */ 

) 

int start_of f set, end_offset; 

start offset = mudlib_get_Rl_of f set ( ptov map, 
~~ tot virt i 

start phys user, 
start_virt_user ) ; 

MUDLIB_INCR_VIRT_USER ( ptovjnap, end_phys_user , end_virt_user ) 

end offset = mudlib_get_Rl_of f set ( ptov map, 

~~ tot virt users, 

end phys user, 
end_virt_user ) ; 

return ( (end_offset - start_of f set) * sizeof(BF8) ); 

} 

' / * Return the number of virtual users 

* corresponding to a specified starting physical user, virtual _ 

* user (within the starting physical user) pair and an ending physical 

* user, virtual user pair, inclusive. 

int mudlib get nura virt_users ( 

unsigned char *ptov map, /* no more than 256 virts. per pnys */ 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_userj 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end_phys_user] */ 

) 

int i, num_virt_users; 

if ( start_phys user == end phys user ) 

return ( end_virt_user - start_virt_user + 1 ) ; 

4 
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=lse { 

num_virt users = ptov map [start phys_user] - 
for ( i = (start_phys user + 1) ; i < end __phy 

num virt users += ptov map[i] ; 
num virt_users += (end virt_user 
return ( num_virt_users ) ; 



start virt_us 
id__phys_user; i++ ) 



For a specified starting physical user, virtual user 
(within the starting physical user) pair and a specified 
number of virtual users inclusive of the starting pair, 
return (in separate arguments), the corresponding ending 
physical user, virtual user pair (inclusive) . 



mudlib get end user_pair ( 

unsigned char *ptov map, 
int start phys user, 
int start_virt_user, 
*/ 

int num virt users, 
int *end phys user, 
int *end virt_user 



/* no more than 256 virts. per phys */ 
/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* number from start (must be > 0) */ 

/* zero-based index into ptov map */ 

/* will be < ptov_map [*end_phys_user] */ 



) 



int 



for ( i = start phys user; ; i++ ) { 

for ( j = start virt user; j < ptov mapti]; j++ ) 

if ( --num virt users == 0 ) break; 

if ( num virt users == 0 ) break; 
start_virt_user = 0; 

} 



5 
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int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 

*/ 

) 

{ 

int i, iv ; 

int RO_skipped_virt_users, R0_tcols, tools, size; 

tools = (tot_virt_users + R_MATRIX_ALIGN_MASK) & ~R_MATRIX_ALIGN_MASK; 
RO_skipped_virt_users = 0; 
size = 0; 

for ( i = 0; i < start phys user; i++ ) { 

for ( iv = 0; iv < (int) ptov_map [i] ; iv++ ) { 

R0_tcols = tcols - <R0_skipped_virt_users & ~ R_MATR I X_AL I GN_MAS K ) ,- 

size += R0 tcols; 
++R0_skipped_yirt_users ; 

|3 /* Handle last physical user, potentially split on virt users */ 

% % for ( iv = 0; iv < (int) start_virt_user ; iv++ ) { 

Mi 

y3 R0_tcols = tcols - (R0_skipped_virt_users & ~ R_MATR I X_AL I GN_MAS K ) ; 



size += R0 tcols; 
++R0_skipped_virt_ 



return siz 



int mudlib get R0 size v ( 

unsigned char *ptov_map, /* no more than 256 virts . per phys */ 

int tot_virt_users, /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_user] 

int end phys user, /* zero -based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end_phys_user] */ 



{ 



) 



int i, iv; 

int R0_skipped_virt_users, R0_tcols, tcols, size; 

tcols = (tot_virt_users + R_MATRIX_ALIGN_MASK) & ~R_MATRIX_ALI GN_MASK ; 

R0 skipped virt users = 0; 

for ( i = 0; i < start phys user; i++ ) 

R0_skipped_virt_users += (int) ptov_map [i] ; 

R0_skipped_virt_users += start_virt_user ; 

// printf ("skipped: %d\n» , R0_skipped_virt_users) ; 

size = 0; 

if ( start_phys_user == end_phys_user ) 
^ // printf ("start == end phys\n»); 
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I J <= for Inclusive , 

for ( iv = start_virt_user; iv <= (int) end_virt_user ; iv++ ) { 



R0_ 



tcols = tcols - (RO_skipped_virt_users & ~ R_MATR I X_AL I GN_MAS K ) ,- 



size += R0 tcols; 

// printf ("size: %d, ROtc: %d\n", size, R0_tcols) ; 
++R0 skipped_virt_users; 

>' " 

else 

^ for ( i = start_phys_user; i < end phys user; i++ ) { 
for ( iv = 0; iv < (int) ptov_map [i] ; iv++ ) { 

R0_tcols = tcols - (RO_skipped_virt_users & ~ R_MATR I X_AL I GN_MAS K ) ; 
size += R0 tcols; 

// printf ("size: %d, ROtc: %d\n", size, R0_tcols) ; 
++R0 skipped_virt_users; 

H /* Handle last physical user, potentially split on virt users */ 

P // printf ("last phys user \n") ; 

iTI // <= for Inclusive . , 

for { iv = start virt user; iv <= (int) end_virt_user ; iv++ ) { 

m - 

ifl R0_tCOlS = tcols - (RO_skipped_virt_users & ~R_MATRIX_ALIGN_MASK) ; 

^ size += R0 tcols; , 

// printf ("size: %d, ROtc: %d\n", size, R0_tcols) ; 

P ++RO_skipped_virt_users; 



} 



} 



int mudlib get Rl of f set_v ( . 

unsigned char *ptov map, /* no more than 256 virts. per phys */ 
int tot_virt_users,~ /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user /* must be < ptov_map [start_phys_user] 

*/ 

) 

int i, tcols, virt_users; 

tcols = (tot_virt_users + R_MATR I X_AL I GN_MAS K ) & ~R_MATRIX_ALIGN_MASK; 
virt_users = 0 ; 

// Main loop . 
for ( i = 0; i < start phys user; i++ ) { 
virt users += (int) ptov_map [i] ; 

} 

// Trailing virtual users 
virt_users += start_virt_user ; 



} 



return ( virt_users ■ 



3 
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ru 



int mudlib get Rl size v ( _ 

unsigned char *ptov_map, /* no more than 256 virts. per phys */ 

int tot_virt_users , /* sum of ptov_map over all phys users 

int start phys user, /* zero-based index into ptov map */ 

int start_virt_user, /* must be < ptov_map [start_phys_user] 

int end phys user, /* zero-based index into ptov map */ 

int end_virt_user /* must be < ptov_map [end_phys_user] */ 



) 

int i, tcols, virt_users; 

tCOls = (tOt_virt_USers + R_MATRIX_ALIGN_MA.SK) & ~ R_MATR I X_AL I GN_MAS K ; 
virt_users = 0; 

if ( start_phys_user == end_phys_user ) 
^ virt_users = end_virt_user - start_virt_user + 1; 
else if (start_phys_user < end__phys_user) 
// Leading virtual users 

virt_users = (int) ptov_map [start_phys_user] - start_virt_user ; 
// Main loop 

for ( i = (start phys user + 1) ; i < end_phys_user ; i++ ) 
virt_users += (int) ptov_map [i] ; 

irtual users 
(end_virt_user + 1) ; 
} 

return ( virt_users * tcols ) ; 
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// 

// Asynchronous MPIC 
// 

#if TIME 

#include <tmr.h> 
#endif 

#include "mudlib.h" 

void sve3_8bit( BF8 *A, BF8 *B, BF8 *C, BF32 *sum, int n ) ; 

A, BF8 *B0, BF8 *B1 , BF8 *B2, 
*sums, int N, int tools ) ; 

void dotpr6_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *sums, int N, int tcols ) ; 

void dotpr9_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2 , 
BF32 *sums, int N, int tcols ) ; 

#if TIME 

static int time_count = 0; 

static int z; 

static float time; 

static TMR ts timeO, timel; 

static TMR_timespec elapsed; 

#endif 

' '* void async multirate mpic ( BF8 *Bt hat, BF8 *R0 hat, 

* BF8 *R1 hat, BF8 *Rlm hat, 

* BF32 *Y, BF32 Ythresh, 

* int N_users, int N_bits, int N_stages ) 



* N users must be 

* N_bits must be > 
*/ 

void mudlib_mpic ( BF8 *Bt hat, 
BF8 *R0 hat, 
BF8 *R1 hat, 
BF8 *Rlm_hat, 
BF32 *Y, 
BF32 Ythresh, 
int N users, 
int N bits, 
int N_stages ) 



BF8 *Bt hatp; 

BF8 *R0 hatp, *Rl_hatp, *Rlm_hatp; 
BF3 2 *Yp; 

BF32 R bias, sums [3] ; 

int hat_tc, i, m, H_users_pad, stage; 



hat tc = (N users + R MATRIX ALIGN MASK) & ~R MATRIX ALIGN MASK; 
N_users_pad~= (N_users + ALTIVEC_ALIGN_MASK) & ~ALTIVEC_ALIGN_MASK; 

#1 if ° ( ( (long)Bt hat | (long)RO upper bf | (long)RO lower bf | 

(long)Rl trans bf | (long) Rim bf) & ALTIVEC ALIGN_MASK ) { 
printf ( "***** inputs are NON-ALIGNED *****\n" ) ; 
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} 

#endif 

// 

// Subtract interference in N_stages 
// 

for ( stage = 0; stage < N_stages; stage++ ) { 

R0 hatp = R0 hat; 
Rl hatp = Rl hat; 
Rim hatp = Rlm_hat; 
Yp = Y; 

for ( i = 0; i < N_users; i++ ) { 

sve3_8bit( R0_hatp, Rl_hatp, Rlm_hatp, &R_bias, N_usersj?ad ); 

#if 0 

R0_hatp[i] = BF8_ZERO; /* zero diagonal element */ 

#endif 

Bt hatp = Bt_hat + hat_tc; /* points to leading row */ 

m = 2; 

Pi while ( m < (N bits-4) ) { 

% if ( BFABS ( Yp[m] ) < Ythresh ) { 

W if ( BFABS ( Yp[m+1] ) < Ythresh ) { 

=|3 if ( BFABS ( Yptm+2] ) < Ythresh ) { 

t g dotpr9_8bit( Bt hatp, Rl hatp, R0 hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums[0] -= R bias; 

sums [1] - = ( (BF32) Bt hatp [hat tc + i] * (BF32) Rl_hatp [i] ) ; 
if ( (Yp[m] - sums[0]) > BF32 ZERO ) 

Bt_hatp [hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt hatp [hat tc + i] = -1 + BIAS 8BIT; 
suras [1] += ( (BF32)Bt_hatp [hat_tc + i] * (BF32) Rl_hatp [i] ) ; 

sums [1] -= R bias; 

sums [2] -= { (BF32) Bt hatp[2*hat tc + i] * (BF32) Rl_hatp [i] ) ; 
if ( (Yp[m+1] - sums[l]) > BF32 ZERO ) 

Bt_hatp [2*hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt hatp[2*hat tc + i] = -1 + BIAS 8BIT; 
sums[2] += ( (BF32)Bt_hatp [2*hat_tc + i] * (BF32) Rl_hatp [i] ) ; 

sums [2] -= R bias; 

if ( (Yp[m+2] - sums [2]) > BF32 ZERO ) 

Bt_hatp[3*hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt_hatp[3*hat_tc + i] = -1 + BIAS_8BIT; 

else { /* skip third sum */ 

dotpr6_8bit( Bt hatp, Rl hatp, R0 hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums [0] -= R bias; 

sums [1] -= ( (BF32 ) Bt hatp [hat tc + i] * (BF32) Rl_hatp [i] } ; 
if { {Yp[m] - sums[0]) > BF32 ZERO ) 

Bt_hatp [hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt hatp [hat tc + i] = -1 + BIAS 8BIT; 
sums[l] += ( (BF32)Bt_hatp [hat_tc + i] * (BF32) Rl_hatp [i] ) ; 

sums [1] -= R bias; 

if ( <Yp[m+l] - sums[l]) > BF32 ZERO ) 

Bt_hatp[2*hat_tc + i] = 1 + BIAS_8BIT; 
else 
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#if 10 
#endif 



#if 10 
#endif 



Bt_hatp[2*hat_tc + i] = -1 + BIAS_8BIT; 



bump leading row pointer */ 
bump row */ 



else { /* skip second sum */ 

dotpr3_8bit( Bt hatp, Rl hatp, RO hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums [0] -= R bias; 

if ( (Yp[m] - sums[0]) > BF32 ZERO ) 

Bt_hatp [hat_tc + i] = 1 + BIAS_8BIT; 
else 

^ Bt_hatp [hat_tc + i] = -1 + BIAS_8BIT; 



Bt_hatp • 



hat_tc ; 



/* bump leading row pointer */ 
/* bump row */ 



Bt_hatp += hat_tc; 



bump leading row pointer */ 
bump row */ 



do last 0, 1 or 2 dot product calculations 



while ( n 

if ( BFABS ( Yp[m] ) < Ythresh ) { 

dotpr3_8bit( Bt hatp, Rl hatp, RO hatp, Rlm_hatp, 

sums, N_users_pad, hat_tc ) ; 
sums [0] -= R bias; 

if ( (Yp[m] - sums[0]) > BF32 ZERO ) 

Bt_hatp [hat_tc + i] = 1 + BIAS_8BIT; 
else 

Bt_hatp [hat_tc + i] = -1 + BIAS_8BIT; 



#if 10 
#endif 



} 

Bt_hatp • 
+ +m; 



/* bump leading : 



#endif 
, } 



R0 hatp += hat tc; 
Rl hatp += hat tc; 
Rim hatp += hat_tc; 
Yp += N_bits; 



/* bump pointer */ 

/* bump pointer */ 

/* bump pointer */ 

/* bump pointer */ 

/* end of loop over N users */ 
/* end of loop over N_stages */ 



#if defined ( COMPILE_C ) 

void dotpr3_8bit( BF8 *A, 
BF32 *s 

{ 

int j ; 

sums[0] = BF32 ZERO; 



) { 
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suras[0] += (BF32)A[j] * (BF32) BO [j ] ; 
sums[0] += (BF32)A[tcols+j] * (BF32) Bl [j ] ; 
sums[0] += (BF32) A[ (tCOls<<l) +j] * (BF32) B2 [j ] ; 



int i, j; 

for ( i = 0; i < 2; i++ ) { 

sums [i] = BF32_ZERO; 

for ( j = 0; j < N; j + + ) { 

sums [i] += (BF32) A[i*tcols + j] * (BF32 ) BO [ j ] ; 

sums[i] += (BF32)A[ (i+1) *tcols + j] * (BF32) Bl [j ] 

sums[i] += (BF32)A[(i+2)*tcols + j] * <BF32)B2[j] 



void dotpr9_8bit( BF8 *A, BF8 *B0, BF8 *B1, BF8 *B2, 
BF32 *suras, int N, int tcols ) 

{ 

int i , j ; 



for ( i = 0; i < 3; i++ ) { 

sums[i] = BF32_ZERO; 

for ( j = 0; j < N; j++ ) { 

sums[i] += (BF32)A[i*tcols + j] 
sums[i] += (BF32)A[ (i+1) *tcols 
sums[i] += (BF32)A[ (i+2) *tcols • 



* (BF32)B0 [j] ; 
j] * (BF32)Bl[j] 
j] * (BF32)B2[j] 



void sve3 8bit ( BF8 *A, BF8 
{ 

int i ; 
BF32 wsum; 

wsum = 0 ; 

for ( i = 0; i < n; i++ ) { 

wsum += (BF32)A[i] ; 

wsum += (BF32)B [i] ; 

wsum += (BF32)C[i] ; 

} 

*sum = wsum; 



; B, BF8 *C, BF32 *sum, int n ) 
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f 
Q 

m 



in 



-- MC Standard Algorithms -- PPC Macro language Version 
File Name: GEN R_MATRICES . MAC 

Description: Float and scale R matrix values, convert to byte. 

Entry /params : GEN_R_MATRICES { Rsump, Bf scalep, Inv scalep, 
Scalep, No scale row bfp, 
Scale_row_bfp, Num_virt_users ) 



bf scale = *bf scalep; 
inv_scale = *inv_scalep; 



0 ; i < num_ 



irt_users; i++ ) { 



fsum scale = fsum * inv_scale; 
fsum scale *= scale; 



SATURATE ( f sum_scale ) 
SATURATE ( fsum ) 



no scale row bfp[i] 
scale_row_bfp [i] = E 



Revision 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 

ate Engineer Reason 



fpl Created 

fpl Removed VMAXFP and added 

windin code 
fpl Removed all windin and windout 



#include "salppc.inc" 
#define DO_IO 1 
#if DO IO 

ttdefine SCALE_BUMP_16 16 
ielse 

#define SCALE_BUMP_16 0 
#endif 

ttdefine STORE_SCALE ( vS, rA, rB ) 
ttdefine ZERO_COND 6 
RODATA_SECTION( 6 ) 
START_L_ARRAY ( local_table ) 

/*' 



STVX( vS, rA, rB ) 



First stage for byte pack 

L_PERMUTE_MASK( 0x0004080c, 0x1 0 14181c , 0x0004080c, 0x1014181c ) 
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Second stage for byte pack 
**/ 

L_PERMUTE_MASK ( 0x00010203, 0x04050607, 0x10111213, 0x14151617 ) 
END ARRAY 



Input parameters 



C3 



#define 


Rsump 


r3 


#def ine 


Bf scalep 


r4 


#def ine 


Inv scalep 


r5 


#def ine 


Scalep 


r6 


#def ine 


No scale row 


bfp r7 


#def ine 


Scale row bfp r8 


#def ine 


Num_virt_users r9 








Local GPRs 




**/ 






#def ine 


indxl 


rlO 


#def ine 


indx2 




ttdefine 


indx3 


rl2 


ttdefine 


low4 


r - 


ttdefine 


tptr 


indx2 


ttdefine 


1ow4x4 


low4 


/** 






G4 registers 




**/ 






ttdefine 




vO 


ttdefine 


inv scale 


vl 


ttdefine 


bf_scale 


v2 


ttdefine 


byte pack 


v3 


ttdefine 


byte_merge 


v4 


ttdefine 


scaleO 


v5 


ttdefine 


scalel 


v6 


ttdefine 


vtmp 


scalel 


ttdefine 


scale2 


v7 


ttdefine 


vtmp2 


scale2 


ttdefine 


scale3 


v8 


ttdefine 


f sumO 


v9 


ttdefine 


f suml 


vlO 


ttdefine 


f sum2 


vll 


ttdefine 




vl2 


ttdefine 


fsum scaleO 


vl3 


ttdefine 


fsum scalel 


vl4 


ttdefine 


fsum scale2 


vl5 


ttdefine 


f sum_scale3 


vl6 


ttdefine 


bsumO 


vl7 


ttdefine 


bsuml 


vl8 


ttdefine 




vl9 


ttdefine 


bsum3 


v20 


ttdefine 


bsum scaleO 


v21 


ttdefine 


bsum scalel 


V22 


ttdefine 


bsum scale2 


v2 3 


ttdefine 


bsum_scale3 


v24 


ttdefine 




v2 5 
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#define bscale_vector v26 

#define rsumO v27 

#define rsuml v28 

#define rsum2 v2 9 

#define rsum3 v3 0 

#define seven v31 

/** 

Begin code text 
**/ 

FUNC_PROLOG 

ENTRY_7 ( gen R matrices, Rsump, Bf scalep, Inv scalep, Scalep, \ 
No_scale_row_bfp, Scale_row_bf p, Num_virt_u£ 

CMPWI ( Num_virt_users, 0 ) 
BGT ( start ) 

RETURN 

LABEL ( start ) 

USE_THRU_v31 ( VRSAVE_COND ) 
/** 

M= Load up permute vectors and loop scalers 

t*\ **/ 

5 LA( tptr, local_table, 0 ) 

M LI ( indxl, 16 ) 

%B LVX{ byte_pack, 0, tptr ) 

VSPLTISB( seven, 7 ) 
LVX( byte merge, tptr, indxl ) 
SCALAR SPLAT ( bf scale, vtmp, Bf scalep ) 
13 SCALAR_SPLAT ( inv_scale, vtmp, Inv_scalep ) 



- 



^Back up to nearest 16-byte boundary. It's okay to write before and after to 
O nearest 16-byte boundary in both directions. 

RLWINM ( low4, No scale_row__bf p , 0, 28, 31 ) /* lower 4 bits */ 

VXOR( zero, zero, zero ) 

ADD ( Num virt users, Num virt users, low4 ) 
SUB ( No scale row bfp. No scale row bfp, low4 ) 
SUB ( Scale row bfp, Scale_row_bfp, low4 ) 
SLWI ( low4x4, low4, 2 ) 
LI ( indx2, 32 ) 
SUB( Rsump, Rsump, low4x4 ) 

/** 

Start up loop 
**/ 

LVX ( rsumO , 0 , Rsump ) 
LI ( indx3, 4 8 ) 
LVX( rsuml, Rsump, indxl ) 
SUB ( Scalep, Scalep, low4x4 ) 
LVX ( rsum2 , Rsump , indx2 ) 
VCFSX( fsumO, rsumO, 0 ) 
LVX( rsum3, Rsump, indx3 ) 
VCFSX( fsuml, rsuml, 0 ) 
LVX( scaleO, 0, Scalep ) 
VCFSX{ fsum2, rsum2 , 0 ) 
L VX ( scalel, Scalep, indxl ) 
VCFSX( fsum3, rsum3 , 0 ) 
LVX( scale2, Scalep, indx2 ) 
VMADDFP ( fsumO, fsumO, bf scale, zero ) 
LVX( scale3, Scalep, indx3 ) 
VMADDFP ( fsuml, fsuml, bf scale, zero ) 
ADDIC C( Num virt users, Num virt users, -16 ) 
VMADDFP ( fsum2, fsum2, bf_scale, zero ) 
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a 



VMADDFP ( fsum3, f sum3 , bf scale, zero ) 
YMADDFP ( fsum scaleO, fsumO, inv scale, zero ) 
VMADDFP { fsum scalel, fsuml, inv scale, zero ) 
VMADDFP ( fsum scale2, f sum2 , inv_scale, zero ) 
ADDI ( Rsump, Rsump, 64 ) 

VMADDFP ( fsum_scale3, f sum3 , inv_scale, zero ) 
ADDI ( Scalep, Scalep, 64 ) 

VMADDFP { fsum scaleO, fsum scaleO, scaleO, zero ) 
VMADDFP ( fsum scalel, fsum scalel, scalel, zero ) 
VMADDFP ( fsum scale2 , fsum scale2, scale2, zero ) 
VMADDFP ( fsum scale3, fsum_scale3, scale3, zero ) 
BLE ( sixteen_sums ) 

LVX( rsumO, 0, Rsump ) 

LVX( rsuml, Rsump, indxl ) 

VCTSXS( bsumO, fsumO, 24 ) 

LVX( rsum2, Rsump, indx2 ) 

VCTSXS( bsuml, fsuml, 24 ) 

VCTSXS( bsum2, fsum2, 24 ) 

LVX{ rsum3, Rsump, indx3 ) 

ADDI ( Rsump, Rsump, 64 ) 

VCTSXSC bsum3, fsum3, 24 ) 

LVX( scaleO, 0, Scalep ) 

VCTSXS( bsum scaleO, fsum scaleO, 24 ) 

VCTSXSC bsum_scalel, fsum scalel, 24 ) 

LVX( scalel, Scalep, indxl ) 

VCTSXS( bsum_scale2, fsum scale2, 24 ) 

LVX( scale2, Scalep, indx2 ) 

ADDI ( No scale row bfp, No scale row_bfp, -SCALE_BUMP_16 ) 

VCTSXS( bsum scale3, fsum scale3, 24 ) 

ADDI ( Scale_row_bfp, Scale_row_bf p, -SCALE_BUMP_16 ) 

BR ( mloop ) 

/** 

Top of loop outputs 32 bytes per trip 
**/ 

LABEL ( loop ) 
/* { */ 

STORE SCALE ( bvector, 0, No scale_row bfp ) 
VCTSXSC bsum_scale3, fsum scale3 , 24 ) 
STORE_SCALE( bscale_vector , 0, Scale_row_bf p ) 



LABEL ( mloop ) 

LVX( scale3, Scalep, indx3 ) 
VCFSX( fsumO, rsumO, 0 ) 

VPERM( bsumO, bsumO, bsuml, byte_pack ) 
VCFSX( fsuml, rsuml, 0 ) 
VCFSX< fsum2, rsum2, 0 ) 

ADDI ( No scale row bfp, No_scale_row_bf p, SCALE_BUMP_16 ) 
VCFSX( fsum3, rsum3 , 0 ) 

ADDI ( Scale row_bfp, Scale row bfp, SCALE_BUMP_1 6 ) 
VMADDFP ( fsumO, fsumO, bf scale, zero ) 
VPERM( bsum2, bsum2, bsum3 , byte_pack ) 
VMADDFP ( fsuml, fsuml, bf scale, zero ) 
VMADDFP ( fsum2, fsum2, bf_scale, zero ) 

VMADDFP ( fsum3, fsum3, bf scale, zero ) 
VMADDFP ( fsum scaleO, fsumO, inv scale, zero ) 
VPERM( bvector, bsumO, bsum2 , byte merge ) 
VMADDFP ( fsum scalel, fsuml, inv scale, zero ) 
ADDIC C( Num virt users, Num_virt users, -16 ) 
VMADDFP ( fsum scale2, fsum2, inv scale, zero ) 
VMADDFP { fsum_scale3, fsum3, inv_scale, zero ) 
ADDI ( Scalep, Scalep, 64 ) 

VMADDFP ( fsum scaleO, fsum scaleO, scaleO, zero ) 

VPERM( bsum scaleO, bsum scaleO, bsum scalel, byte_pack ) 

VMADDFP ( fsum_scalel, fsum_scalel, scalel, zero ) 
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VMADDFP ( fsum scale2, fsum scale2, scale2, zero ) 
VMADDFP ( fsum scale3, fsum scale3, scale3, zero ) 
VPERM ( bsum scale2, bsum scale2, bsum_scale3, bytejpack ) 

VSRB( vtmp, bvector, seven ) 
VPERM ( bscale vector, bsum scaleO, bsum_scale2, byte_merge ) 

VSRB{ vtmp2, bscale_vector , seven ) 
BLE ( loop_f lush ) 

LVX( rsumO, 0, Rsump ) 

VADDSBS( bvector, bvector, vtmp ) 

LVX( rsuml, Rsump, indxl ) 

VADDSBS( bscale vector, bscale_vector, vtmp2 ) 

LVX( rsum2, Rsump, indx2 ) 

VCTSXS ( bsumO, fsumO, 24 ) 

LVX( rsum3, Rsump, indx3 ) 

VCTSXS{ bsuml, fsuml, 24 ) 

ADD I ( Rsump, Rsump, 64 ) 

VCTSXS( bsum2, fsum2, 24 ) 

LVX( scaleO, 0, Scalep ) 

VCTSXS( bsum3, fsum3, 24 ) 

LVX( scalel, Scalep, indxl ) 

VCTSXS ( bsum scaleO, fsum scaleO, 24 ) 

VCTSXS ( bsum_scalel, fsum scalel, 24 ) 
I s * LVX( scale2, Scalep, indx2 ) 

ip VCTSXS ( bsum_scale2, fsum_scale2, 24 ) 

'-: /* 1 */ 

V S BR ( loop ) 

■if /** 

. ;==> Flush loop 

s **/ 

M LABEL ( loop flush ) 

|T1 VADDSBS ( bvector, bvector, vtmp ) 

STORE SCALE ( bvector, 0, No scale row bfp ) 
r L. VADDSBS ( bscale vector, bscale vector, vtmp2 ) 

O STORE_SCALE( bscale vector, 0, Scale row bfp ) 

hi ADDI ( No scale row bfp. No scale row bfp, SCALE BUMP_16 ) 

l2 ADDI ( Scale_row_bfp, Scale_row_bfp, SCALE_BUMP_1 6 ) 

sfi LABEL ( sixteen_sums ) 

fl 

~\ VCTSXS ( bsumO, f sumO , 24 ) 

?y VCTSXS ( bsuml, fsuml, 24 ) 

VCTSXS ( bsum2, fsum2, 24 ) 

VCTSXS ( bsum3, fsum3, 24 ) 

VCTSXS ( bsum scaleO, fsum scaleO, 24 ) 

VPERM ( bsumO, bsumO , bsuml, byte pack ) 

VCTSXS ( bsum scalel, fsum scalel, 24 ) 

VPERM ( bsum2, bsum2 , bsum3 , byte pack ) 

VCTSXS ( bsum scale2, fsum scale2, 24 ) 

VPERM ( bvector, bsumO , bsum2, byte merge ) 

VCTSXS ( bsum_scale3, fsum_scale3, 24 ) 

VPERM ( bsum scaleO, bsum scaleO, bsum scalel, byte pack ) 
VPERM ( bsum scale2, bsum scale2, bsum_scale3, byte_pack ) 

VSRB( vtmp, bvector, seven ) 
VPERM ( bscale vector, bsum scaleO, bsum_scale2, byte_merge ) 

VADDSBS ( bvector, bvector, vtmp ) 

VSRB( vtmp, bscale vector, seven ) 
STORE SCALE { bvector, 0, No scale row bfp ) 

VADDSBS ( bscale vector, bscale vector, vtmp ) 
STORE_SCALE( bscale_vector , 0, Scale_row_bfp ) 

/** 

Return 
**/ 

LABEL ( ret ) 

FREE THRU v31( VRSAVE_COND ) 
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Majority Voter Control Logic 

Description: This Module serves as a generic majority voter 

Author : Steven Imperiali/Mirza Cifric 

Date : 5-15-2000 



LIBRARY IEEE; 

USE IEEE.STD LOGIC 1164 .ALL; 
use ieee.std logic arith.all; 
use ieee.std logic unsigned. all ; 
USE STD . TEXTIO . ALL ; 



ENTITY m_voter IS 
PORT( 

std logic ; 
std logic; 
std logic; 
std logic; 
std logic ; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic; 
std logic ; 
std_logic) 



END m_voter; 

ARCHITECTURE voter OF m voter IS 

signal pro: STD_LOGIC VECTOR (3 downto 0) ; 

signal against: STD_LOGIC_VECTOR (3 downto 0) ; 

signal result: STD_LOGIC; 

BEGIN 



elk 66 pal6 


IN 


reset 0 




IN 


requestO 


0 


IN 


reguestl 


0 


IN 


reguest2 


0 


IN 


request 3 


0 


IN 


request4 


0 


IN 


healthyO 


1 


IN 


healthyl 


1 


IN 


healthy2 


1 


IN 


healthy3 


1 


IN 


healthy4 


1 


IN 


voteout 0 


OUT 



check result: process (request 0_0 , requestl_0 , request2_0, request3_0, request4_0 , h 
ealthyO 1, 

healthyl_l,healthy2 l,healthy3 l,healthy4 1) 
variable pro: STD_L0GIC VECTOR (3 downto 0) ; 
variable against: STD LOGIC VECTOR (3 downto 0) ; 
variable solution: STD_LOGIC; 
begin 

pro:= "0000"; -- set number of pro voters 

against :="0000" ; 
-- set number of against voters-- Get the number of pros 
if (healthy0_l = '1' and request0_0= ' 1 ' ) then 
pro := pro + "0001"; 
end if; 

if (healthyl l='l' and requestl_0= ' 1 ' ) then 
pro : = pro + "00 01" ; 
end if; 

if (healthy2_l=" 1 ' and request2_0= ' 1 1 ) then 
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pro : = pro + "0001" ; 

end if; 
if (healthy3 1= 
pro : = pro 

end i f ; 

if (healthy4 1= 1 1 ' and request4_0= ' 1 ' ) then 
pro := pro + "0001"; 
end if; 
-- Get the number of cons 

if (healthyO 1 = '1' and request0_0= 1 0 ' ) then 
against := against + "0001"; 

if (healthyl 1 = '1' and requestl_0 ='0') then 
against := against + "0001"; 
end i f ; 

if (healthy2 1 = '1' and request2_0 ='0') then 
against := against + "0001"; 
end if; 

if (healthy3 1 ='1' and request3_0 ='0') then 
against := against + "0001"; 
end if; 

if (healthy4 1 ='1' and request4_0 ='0'} then 
against := against + "0001"; 
<& end if; 

3 -- final score 

~ if (pro = "0001" and against < "0001") then 

solution • - 1 1 1 • 



0 elsiffpro 
[i solution := ' 1" 

elsif(pro = "0011" 



0 soluti— - 

3 elsif(pro = "0100 

n solution := '1 . 

elsif(pro = "0101" and against < "0011") then 
solution := '1' 

3 

s else solution 



"0010" and against < "0010") then 
and against < "0011") then 
and against < "0011") then 



result <= solution; -- put variable val into 

signal val 

voteout_0 <= solution; -- put variable val into 



end process check_result ,- 



result_latch:process (reset_0, clk_66_jpal6) 
begin 

IF (reset 0 = ' 0 ' ) THEN 

voteout 0 <= 1 1 ' ; 

ELSIF rising edge (elk 66 pal 6) THEN 
IF result = '0' THEN 

voteout_0 <= 1 0 ' ; 

END IF; 

END IF; 
END PROCESS; 

END voter; 



2 
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* FILENAME : mudlib.h 

* CC NUMBER: 

* ABSTRACT: 

* USAGE: 

* COMMENTS: 

* AUTHOR: M. Vinskus 

* DATE: 18-JUL-200 0 

**/ 

/* ©MERCURY . COPYRIGHT . H@ */ 
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#define BIAS_8BIT 1 



(((x) >= 0) ? (x) : (-(x))) 
(((f) >= 0.0) ? (f) : (-(f))) 



* TYPE DEFINITIONS 
**/ 

typedef long BF32; 
typedef short BF16; 
typedef char BF8; 

typedef struct { 

BF8 real ; 

BF8 imag ; 
} COMPLEX_BF8 ; 

typedef struct { 

BF16 real; 

BF16 imag; 
} COMPLEX_BF16; 

fl typedef struct { 

5 BF32 real; 
l ':f BF32 imag; 
\U } COMPLEX_BF32; 

™ /******************************************* **************^^ 

Q * MACRO DEFINITIONS 

IT! ********************************************************************** 

T **/ 

/* assumes (-(2.0 * 7) - 0.5) < (bf_factor * s) < ((2.0 " 7) - 0.5) */ 

P 

hi #define SFtoBF8 ( bf factor, s ) \ 

n ( (BF8) ( (bf_f actor) * (s) + ( ( (s) > 0.0) ? 0.5 : -0.5))) 

42 #define VFtoBF8 ( bf factor, v, bfv, n ) \ 

6 ( \ . x 

int l; \ 

float factor = bf factor; \ 

vsmulx ( v, 1, &factor, v, 1, n, 0 ),- \ 

for ( i = 0; i < n; i++ ) \ 

bfv[i] = (v[i] > 0.0) ? (BF8) (v [i] + 0.5) : (BF8)(v[i] - 0.5); \ 

} 

#define SBF8toF( bf rf actor, bfs ) \ 
( (bf_rfactor) * (float) (bfs) ) 

#define VBF8toF( bf_rf actor, bfv, v, n ) \ 

{ > 

int a; \ 

float rf actor = bf rf actor; \ 
for ( i = 0; i < n; i++ ) \ 

v[i] = (float)bfv[i] ; \ 
vsmulx ( v, 1, &rf actor, v, 1, n, 0 ) ; \ 

} 

/* assumes (-(2.0 * 15) - 0.5) < (bf_factor * s) < ((2.0 * 15) - 0.5) * 

#define SFtoBF16 ( bf factor, s ) \ 

( (BF16) ( (bf_f actor) * (s) + ( ( (s) > 0.0) ? 0.5 : -0.5))) 

#define VFtoBF16 ( bf f actor, v, bfv, n ) \ 
{ \ 



3 
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float factor = bf factor; \ 

vsmulx ( (float *)v, 1, &factor, (float ' 

vfixrx ( (float *)v, 1, (BF16 *)bfv, 1, i 



#define VBF16toP( bf rf actor, bfv, 
{ \ 

float rfactor = bf rfactor; \ 
vfltx ( (short *)bfv, 1, v, 1, 
vsmulx ( v, 1, &rfactor, v, 1, 

} 



assumes (-(2.0 * 31) 



0.5) 



#define SFtoBF32 ( bf factor, 
( (BF32) ( (bfjEactor) * (s) + 



■ ) \ 
( ( (s) 



bfv, 



0.0) 
n ) \ 



? 0.5 



((2.0 * 
-0.5))) 



#def ine VFtoBF32 ( bf factor, 
{ \ 

float factor = bf factor; \ 

vsmulx ( v, 1, ^factor, (float *)bfv, 1, n, 0 
vfixr32x ( (float *)bfv, 1, (int *)bfv, 1, n, 

} 



#define VBF32toF( bf rfactor, bfv, 
{ \ 

float rfactor = bf rfactor; \ 
vflt32x ( (int *)bfv, 1, v, 1, 
vsmulx ( v, 1, &rf actor, v, 1, 



bfv, ((n)«l) 



) 

ttdefine BHAT SFtoBF ( s ) 
ttdefine BHAT SBFtoF ( bfs ) 
#define BHAT VFtoBF ( v, bfv, n ) \ 
{ \ 

float bias = (float) BIAS 8BIT,- \ 
vsaddx( v, 1, &bias, v, 1, n, 0 ) ; \ 
^ fixpixax( v, 1, bfv, n, 0 ) ; \ 

#define BHAT VBFtoF ( bfv, v, n ) \ 
{ \ 

float bias = (float) ( -BIAS 8BIT) ; \ 
fltpixax( bfv, v, 1, n, 0 ); \ 



vsaddx ( 



X, &bia: 



o ) 



\ 



ttdefine RHAT SFtoBF ( s ) 

#define RHAT SBFtoF ( bfs ) 

#define RHAT VFtoBF ( v, bfv, 

#define RHAT_VBFtoF( bfv, v, 

ttdefine Y SFtoBF ( s ) 

ttdefine Y SBFtoF ( bfs ) 

ttdefine Y VFtoBF ( v, bfv, n 

ttdefine Y_VBFtoF ( bfv, v, n 



SFtoBF8 ( : 
SBF8toF( 
VFtoBF8 ( 



RY FACTOR, s ) 
; RY RFACTOR, bfs ) 
RY FACTOR, v, bfv, 



n ) VBF8toF( BF_RY_RF ACTOR, bfv, 

SFtoBF32 ( BF RY FACTOR, s ) 

SBF32toF( BF RY RFACTOR, bfs ) 

VFtoBF32 ( BF RY FACTOR, v, bfv, n ) 

VBF32toF( BF_RY_RFACTOR , bfv, v, n ) 



4 
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{ \ 



: MUDLIB_DECR_VIRT_USER ( ptov_map, phys_user, virt_ 



-virt user; \ 
if ( virt user < 0 ) { \ 
--phys user; \ 

virt user = ptov map [phys_user] 
} \ 



l; \ 



#define MUDLIB_INCR_VIRT_USER ( ptov_map, phys_user, virt_user ) \ 
:_user] ) { \ 



{ \ 



++virt user; \ 

if ( virt user == ptov_map [phys_ 

++phys user; \ 

virt user = 0; \ 
} \ " 



* PUBLIC FUNCTION PROTOTYPES 
**/ 

int mudlib get CorrO offset ( 

unsigned char *ptov_map, 
int num fingers, 
int tot_virt_users. 



/* no more than 256 - 
/* typically, 4 */ 
/* sum of ptovjnap o- 



all phys users 



/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 



) ? 

int mudlib get CorrO size ( 

unsigned char *ptov_map, 
int num fingers, 
int tot_virt_ 
*/ 
int 
int 
*/ 
int 
int 



/* no more than 256 virts. per phys */ 
/* typically, 4 */ 

/* sum of ptovjnap over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptovjnap [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptovjnap [end_jphys_user] */ 



int mudlib get Corrl offset ( 

unsigned char *ptov_map, 
int 
int 
*/ 
int 
int 
*/ 



/* no more than 256 virts. per phys */ 

/* typically, 4 */ 

/* sum of ptovjnap over all phys users 

/* zero-based index into ptov map */ 

/* must be < ptov_map [start_phys_user] 



int mudlib get Corrl size ( 

unsigned char *ptov_map, 
int 
int 
*/ 
int 
int 
*/ 
int 
int 



/* no more than 256 virts. per phys */ 
/* typically, 4 */ 

/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptovjnap [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov_map [end_phys_user] */ 



Page No. 264 



EV 093 931 868 US 



int mudlib get RO offset ( 

unsigned char *ptov_map, 
int tot virt users, 



/* no more than 256 virts. per phys */ 

/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 

/* must be < ptov_map [start_phys_user] 



int mudlib get RO size { 

unsigned char *ptov__map, 
int tot_virt_users, 
*/ 
int 
int 

*/ 

int 
int 



/* no more than 256 virts. per phys */ 

/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 

/* must be < ptov_map [start_phys_user] 

/* zero-based index into ptov map */ 

/* must be < ptov_map [end_phys_user] */ 



int mudlib get Rl offset ( 

unsigned char *ptov 
int tot_virt_users," 



a 
m 



/* no more than 256 virts. per phys */ 

/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 

/* must be < ptov_map [start_phys_user] 



mudlib get Rl size ( 

unsigned char *ptov_map / 
int tot_virt_users, 
*/ 
int 
int 
*/ 
int 
int 



) ; 



mudlib get num_virt_users ( 

unsigned char *ptov map, 
int start phys user, 
int start_virt_user, 
*/ 



no more than 256 virts. per phys */ 
sum of ptov_map over all phys users 



/* zero-based index into ptov map */ 

/* must be < ptov_map [start_phys_user] 

/* zero-based index into ptov map */ 

/* must be < ptov_map [end_phys_user] */ 



no more than 256 virts. per phys */ 
zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov_map [end_phys_user] */ 



void mudlib get end user_pair ( 

unsigned char *ptov map, /' 

int start phys user, /' 

int start_virt_user, /' 
*/ 

int num virt users, /' 

int *end phys user, /■ 

int *end virt user /' 



no more than 256 virts. per phys */ 
zero-based index into ptov map */ 
must be < ptov_map [start_phys_user] 

number from start (must be > 0) */ 
zero-based index into ptov map */ 
will be < ptov_map [*end_jphys_user] * 



void mudlib gen R ( 

COMPLEX BF16 *mpathl bf, 
COMPLEX BF16 *mpath2 bf , 

C0MPLEX_BF8 *corr_0_bf. 



/* adjusted for starting physical user 
/* adjusted for starting physical user 
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unsigned char *ptov_map, 
float *bf scalep, 
*inv_scalep, 



float 
*/ 

float 
char 
BF8 



int 

int 

int 

int 

int 

*/ 

int 



*scalep, 
*L1 cachep, 
*R0 upper bf, 
*R0 lower bf, 
*R1 trans_bf, 
*Rlm bf, 
tot phys users, 
tot virt users, 
start phys user, 
start virt user, 
end_phy s_us er , 

end virt_user 



/* no more than 256 virts. per phys */ 

/* scalar: always a power of 2 */ 

/* adjusted for starting physical user 

/* start at 0 ' th physical user */ 

/* must be 32-byte aligned */ 



/* zero-based ("starting row") */ 

/* relative to start phys user */ 

/* actual number of "rows" to process 

/* relative to end_phys_user */ 



mudlib 4R_to 3R ( 

BF8 *R0 upper bf , 



*R0 lo' 



bf , 



BF8 

char 

BF8 



) f 



■Rl trans bf , 
*L1 cachep, 
*R0 bf, 
*R1 bf, 

tot virt_users 



.0 

m 
p 



void mudlib_mpic ( BF8 *Bt hat, 
BF8 *R0 hat, 
BF8 *R1 hat, 
BF8 *Rlm_hat, 
BF32 *Y, 
BF32 Ythresh, 
int N users, 
int N bits, 
int N_stages 



} ; 



/* input matrix */ 

/* input matrix */ 

/* input matrix */ 

/* 32K-byte temp, 32-byte aligned */ 

/* output matrix */ 

/* output matrix */ 



void mudlib ref ormat_corr 



( COMPLEX *in_corr, 

COMPLEX BF8 *corr 0 bf, 
COMPLEX BF8 *corr l_bf, 
int num virt users, 
int num_multipath ) ; 



int I, COMPLEX SPLIT *B, 
int N, int X ) ; 



temp 



(_v) 



mudlib get CorrO offset v ( 

unsigned char *ptov_map, 
int num fingers, 
int tot_virt_users , 
*/ 



/* no more than 256 virts. per phys */ 

/* typically, 4 */ 

/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 

/* must be < ptovjnap [start_jphys_user] 



mudlib get Corrl offset v ( 

unsigned char *ptov_map, 
int num fingers, 
int tot_virt_users, 
*/ 

int start j>hys_user, 



no more than 256 virts. per pi 

typically, 4 */ 

sum of ptov_map over all phys 



/* zero-based index into ptov_map */ 
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y mudlib.h 



2/23/2001 
-_map [startjhys_user] 



mudlib get R0 offset_v ( 

unsigned char *ptov 
int tot virt_users, 



/* no more than 256 virts. per phys */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 



mudlib get R0 size v ( 

unsigned char *ptov__map, 
int tot_virt_users, 



*/ 

int 

int 

*/ 

int 

int 



start phys user, 
start_virt_user , 



no more than 256 virts. per phys */ 
sum of ptovjmap over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptov_map [start_phys_user] 

/* zero-based index into ptov map */ 
/* must be < ptov_map [end_phys_user] */ 



int mudlib get Rl offset_v ( 

unsigned char *ptov_map, 

int tot_virt_users, 

*/ 

int start phys user, 
int start_virt_user 



/* no more than 256 virts. per phys */ 
/* sum of ptov_map over all phys users 

/* zero-based index into ptov map */ 
/* must be < ptovjmap [start jphys_user] 



mudlib get Rl size v ( 

unsigned char *ptov map, /'■ 

int tot_virt_users, /' 

int start phys user, /■ 

int start jvirtjiser, /' 

int end phys user, /' 

int end_virt_user /* 



no more than 256 virts. per phys */ 
sum of ptovjmap over all phys users 

zero-based index into ptov map */ 
must be < ptov_map [start_phys_user] 

zero-based index into ptov map */ 
must be < ptovjmap [endjphysjiser] */ 



#endif /* _MUDLIB_H */ 
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#include "mudlib.h" 



a3 , max a4) \ 



ttdefine INDEX 5D TO LIN{aO, al, a2 , a3, a4, max al, max a2 , max ; 

- ((a4)~+ (max_a4) * ( (a3) + (max_a3) * ( (a2) + (max_a2) * ( (al) 

+ (max_al) * (aO) ) ) ) ) 

void mudlib reformat corr ( 
COMPLEX *in_corr, 
COMPLEX BF8 *corr 0 bf, 
COMPLEX BF8 *corr l_bf , 
int num virt users, 
int num_f ingers ) 

{ 

int i, j, q, ql; 

for ( i = 0; i < num_virt users; i++ ) { 

for ( j = (i+D ; j < num virt users; j++ ) { 



( q = 0; q < num_fingers; q++ ) { 
for { ql = 0; ql < num fingers; ql+ 
corr_0_bf->real = CORR_SFtoBF ( in 



) 



corr [INDEX 5D_TO_LIN ( 
"i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 
num fingers)] .real ) ; 
corr 0 bf->imag = CORR_SFtoBF< in_corr [INDEX 5D_TO_LIN( 
0, i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 
num_f ingers) ] . imag ) ; 

++corr 0 bf; 



} 

for { i = 0; i 
for ( j = 0; 
for ( q = 0 
for ( ql 
corr_l_ 



num virt users; i++ ) { 
< num virt users; j++ ) ■ 
q < num_f ingers; q++ ) { 
: 0; ql < num fingers; ql+- 
,f->real = CORR_SFtoBF ( in 



) 



cor r_l_bf-> imag = CORR_SFtoBF ( 



++corr_l_bf ; 



[INDEX 5D_TO_LIN( 
i, j, ql, q, 
num virt users, 
num virt users, 
num fingers, 
num fingers)] .real ) ; 
in_corr [INDEX 5D_TO_LIN ( 
1, i, j, ql, q, 
num virt i 
num virt ' 
num fingers, 
num_f ingers) ] .imag ) ; 



1 
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#include "mudlib.h" 



*/ 

char *L1 cachep, 
int A ncols, 
int A nrows, 
int C_tcols 

) ; 

void mtriangle 8bit ( 
BF8 *A, 
BF8 *C, 
int N 

) ; 

void mudlib_4R to 3R ( 

BF8 *R0 upper bf , 
BF8 *R0 lower bf , 
BF8 *R1 trans bf, 
char *Ll_cachep, 
*/ 

BF8 *R0 bf, 
BF8 *R1 bf, 
int tot_virt_users 



/* logically contiguous input 32 x 32 blocks */ 
/* output blocks separated by 32 * out_tc elements 



/* input matrix */ 
/* input matrix */ 
/* input matrix */ 

/* temp: 32K bytes, 32-byte aligned 

/* output matrix */ 
/* output matrix */ 



BF8 *R0 work; 

int i, nrows, R0_tcols, tcols; 

tcols = <tot_virt_users + R_MATRIX_ALIGN_MASK) 



R_MATRI X_AL I GN_MAS K ; 
R_MATRIX_ALIGN ) { 



R_MATRIX_ALIGN ) { 



nrows = R_MATRIX ALIGN; 
for ( i = tot virt_users; i > 
if ( nrows > i ) nrows = i; 

mtrans32_8bit ( Rl trans bf, Rl_bf, Ll_cachep, tot_virt_users 

nrows, tcols ) ; 
Rl trans_bf += (tcols « R_MATR I X_ALI GN_LOG ) ; 
Rl bf += RJMATRIX_ALIGN; 

} 

R0 work = R0 bf; 
R0 tcols = tcols; 
nrows = R_MATRIX ALIGN ; 
for ( i = tot virt_users; i > 

if ( nrows > i ) nrows = i; . 

mtrans32 8bit ( R0 lower_bf, R0 work, LI cachep, l, nrows, tcols ); 
R0 lower bf += (R0_tcols << R MATRIX ALIGN LOG) ; 
RO work += { (tcols « R MATRI X_AL I GN__LOG ) + R_MATR I X_AL I GN ) ; 
RO tcols -= R_MATRI X_AL I GN ; 
} ~ 

mtriangle_8bit ( R0_upper_bf , R0_bf, tot_virt_users ); 

#if COMPILE_C 

void mtrans32 8bit ( 
BF8 *A, 
blocks */ 
BF8 *C, 
*/ 



logically contiguous input A_nrows x A_ncols 
output blocks separated by 32 * C_tcols elements 



1 
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int C_tcols 

) 

BF8 *Ap, *Cp; 

int A tcols, C_nrows; 

int i, j; 

(void) Ll_cachep; 

A tcols = (A ncols + R MATR I X_AL I GN_MAS K ) & ~R_MATRIX_ALIGN_MASK; 
C_nrows = R_MATRIX_ALIGN; 

while ( A ncols ) { 

if { A ncols < C_nrows ) C_nrows = A_ncols; 
Ap = A; 
CP = C; 

for ( i = 0; l < A_nrows; i++ ) { 
for ( j = 0; j < C nrows; j ++ ) 

Cp[j * C tcols] = Ap[j] ; 
Ap += A tcols; 
Cp += 1; 

1 += R MATRIX ALIGN; /* input travels horizontally */ 

C += (C_tcols~« R MATRIX_ALIGN_LOG) ; /* output travels vertically */ 
A_ncols -= Cjnrows; 



void mtriangle 8bit ( 
BF8 *A, 
BF8 *C, 
int N 

) 

^ int A counter, A_tcols, altivec_N, C_tcols; 
int i , j ; 

A counter = (N + R MATRIX_ALIGN_MASK) & ~R_MATRIX_ALIGN_MASK ; 
C_tcols = A_counter + 1; 

altivec_N = (N + ALT I VEC_ALI GN_MAS K ) & ~ ALT I VEC_AL I GN_MASK ; 

for ( i = 0; i < N; i++ ) { 

for ( j = 0; j < altivec_N; j++ ) 
C[j] = A[j] ; 

--altivec N; 

Altcols 11 ^^ counter + R_MATR I X_AL I GN_MAS K ) & ~ R_MATR I X_AL I GN_MAS K ; 
A += (A tcols + 1) ; 
= C tcols; 



#endif /* C0MPILE_C */ 



2 
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mtrans32_8bit .mac 



MC Standard Algorithms -- PPC Macro Language Version 



File Name : 
Description: 



mtrans32 8bit.mac 

Perform N_tiles 32 x 32 byte transposes 



contiguous input 32 x 32 blocks 
output blocks separated by 
32 * out_tc elements 



void mtrans32 8bit ( 
BF8 *A, 
BF8 *C, 

char *L1 cache, 
int A ncols, 
int A nrows, 
int C tcols 



A tcols = (A ncols + R MATRIX ALIGN_MASK) & 

~R MATRIX ALIGN_MASK; 
C_nrows = R_MATRIX_ALIGN; 

while ( A ncols ) { 

if ( A ncols < C_nrows ) C_nrows = A_ncols; 
Ap = A; 
CP = C; 

for ( i = 0; i < A_nrows; l+ + ) \ 
for ( j = 0; j < C nrows; j++ ) 

CpEj * C tcols] = Ap[j] ; 
Ap += A tcols; 
Cp - „ 

A += R MATRIX_ALIGN ; 

C += (C_tcols << R MATR I X_AL I GN_LOG ) ; 
A_ncols -= C_nrows; 



Restrictions: A, C and LI cache must all be 16-byte aligned. 
C_tcols must be a multiple of 16. 

Mercury Computer Systems, Inc. 
Copyright (c) 20 0 0 All rights reserved 



Revision 
0.0 



Date 
000913 



Engineer 
fpl 



Reason 
Created 



#include "salppc.inc" 

#define DO_PRE FETCH 1 

#define LOAD INPUT ( vT, rA, rB ) 
#define LOAD_CACHE ( vT, rA, rB ) 

ttdefine STORE CACHE ( vS, rA, rB ) 
#define STORE_OUTPDT ( vS, rA, rB ) 

#define R MATRIX ALIGN_LOG 
#define R MATRIX ALIGN 



LVXL( vT, rA, rB ) 
LVX( vT, rA, rB ) 



STVX( vS, rA, rB ) 
STVX( vS, rA, rB ) 



R MATRIX ALIGN_LOG) 



#define R_MATR I X_AL I GN_MAS K (R_MATRIX_ALIGN - 1) 



1 
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#define ALTIVEC ALIGN_LOG 
#define ALTIVEC ALIGN 
#define ALTIVEC_ALiIGN_MASK 



(1 « ALTIVEC ALIGN_LOG) 
(ALTIVEC_ALIGN - 1) 



III 

- 

□ 



#if DO PREFETCH 

ttdefine PREFETCH ( rA, rB, STRM. 
DSTT ( rA, rB, STRM ) \ 
ADD ( rA, rA, DST_BUMP ) 

#else 

# define PREFETCH ( rA, rB, 
flendif 



/* 



DST_BUMP ) \ 



STRM, DST_BUMP } 



Four permute vectors for output stage 
**/ 

RODATA_SECTION ( 5 ) 
START_L_ARRAY ( local_table ) 

L PERMUTE MASK ( 0x00010405, 0x08090c0d, 0x10111415, 0xl8191cld ) 
L PERMUTE MASK ( 0x02030607, OxOaObOeOf, 0x12131617, Oxlalblelf ) 
L PERMUTE MASK ( 0x00020406, 0x080a0c0e, 0x10121416, OxlSlalcle ) 
L_PERMUTE_MASK ( 0x01030507, 0x090b0d0f, 0x11131517, 0xl91bldlf ) 

END ARRAY 



Input parameters 
**/ 

#define A 
#define C 
#define Ll_cache 
#define NC 
#define NR 
#define TCC 

ttdefine NC left 
ttdefine TCA 
#define TCA4 
#define icount 

#define aptrO 
ttdefine aptrl 
#define aptr2 
#define aptr3 

#define aindxO 

#define aindxl 

ttdefine aindx2 

ttdefine aindx3 

ttdefine cptrO 
ttdefine cptrl 
ttdefine cptr2 
ttdefine cptr3 

ttdefine cindxO 
ttdefine cindxl 
ttdefine cindx2 
ttdefine cindx3 

ttdefine cindx4 
ttdefine cindx5 
ttdefine cindx6 
ttdefine cindx7 



rl2 
rl3 
rl4 
rl5 



r2 0 
r21 
r22 
r23 

r24 
r25 
r26 
r27 



aindxO 
aindxl 
aindx2 
aindx3 



2 
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mt Vans 3 2 


_8bit.ma 




#define 


out indxO aptrO 


#define 


out indxl aptrl 


#def ine 


out indx2 aptr2 


#define 


out indx3 aptr3 


#def ine 


cptr 


cptrO 


#def ine 


outptrO 


cptrl 


#define 


outptrl 


cptr2 


#def ine 


TCC4 


cptr3 


#define 


tptr 


icount 


#define 


temp 


aptr3 


^define 


Cbump 


rO 


^define 


dstp 




fjdef ine 


dst_code 


r28 


/** 






G4 registers 




**/ 






^define 


aOO 


vO 


#define 


aOl 


vl 


#def ine 


a02 


v2 


#define 


a03 


v3 


#define 


alO 


v4 


#def ine 


all 


v5 


#define 


a!2 


v6 


#def ine 


al3 


v7 


#def ine 


a2 0 


v8 


#def ine 


a21 


v9 


#define 


a2 2 


vlO 


#define 


a2 3 


vll 


#define 


a30 


vl2 


#def ine 


a31 


vl3 


#def ine 


a32 


vl4 


#def ine 


a33 


vl5 


#def ine 


COO 


vie 


#def ine 


cOl 


vl7 


#def ine 


c02 


vl8 


#define 


c03 


vl9 


ttdefine 


clO 


v20 


#define 


ell 


v21 


#define 


cl2 


v22 


#define 


C13 


v23 


#define 


c20 


cOO 


#define 


C21 


cOl 


#def ine 


C22 


c02 


#def ine 


c23 


c03 


#def ine 


C30 


clO 


#def ine 


c31 


ell 


#define 


c32 


cl2 


#def ine 


c33 


cl3 


#def ine 


vtO 


V24 


#define 


vtl 


v25 


#def ine 


vt2 


v26 


#define 


vt3 


v27 


#define 


vt4 


cOO 


ttdefine vt5 


cOl 
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Page No. 273 



EV 093 931 868 US 



mtrans32 


_8bit.T 








C02 


# e me 


V t7 


c03 


#def ine 


Q 


v28 


#def ine 


fi- 


v29 




ve 2 


v30 


#def ine 


vp3 
V 


v31 


#define 


CO 


aOO 








#def ine 


°2 


a02 


#def ine 


- 


a03 


* , e ,.7 ne 




alO 


ffderine 


C 


all 


^define 




al2 


^define 


c7 


al3 


#def ine 


outO 


a20 


^define 


outl 


a21 


#def ine 


out 2 


a22 


^define 


out 3 


a23 


#def ine 


out 4 


a30 


^define 


out 5 


a31 


#def ine 


out 6 


a32 


^define 


out 7 


a33 


/** 







Text begins 
FUNC PROLOG 

ENTRY_5{ mtrans32_8bit, A, C, Ll_cache, 



ADDK TCA, NC, R MATRIX_ALIGN_MASK ) 
CMPWK NC left, 32 ) 

RLWINM ( TCA, TCA, 0, 0, (31 - R_MATRIX_ALIGN_LOG) ) 
LA ( tptr, local table, 0 ) 

MAKE_STREAM_CODE_IIR ( dst_code, 64, 4, TCA ) 

LVX( vpO, 0, tptr ) 

ADDK tptr, tptr, 16 ) 

LVX( vpl, 0, tptr ) 

ADDK tptr, tptr, 16 ) 

XORI ( temp, A, 32 ) 

LVX( vp2, 0, tptr ) 

ADDK tptr, tptr, 16 ) 

SLWK TCA4, TCA, 2 ) 

LVX( vp3, 0, tptr ) 

BLE ( cont ) 

ANDI C( temp, temp, 32 ) 
BR ( cont ) 

''outer loop transposes 2 (or 1 at end) 32 x 32 tiles per trip 
**/ 

LABEL ( outer_loop ) 
/* { */ 

CMPWK NC_left, 3 2 ) 

LABEL ( cont ) 

ADD ( dstp, A, TCA4 ) /* start prefetch advanced */ 

MR ( aptrO, A ) 
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g 
p 



ADD ( dstp, dstp, TCA ) /* advanced further */ 

LI( aindxO, 0 ) 

ADD ( aptrl, aptrO, TCA ) 

LI( aindxl, 16 ) 

ADD ( aptr2, aptrl, TCA ) 

MR ( cptrO, LI cache ) 

ADD ( aptr3, aptr2 , TCA ) 

ADDI ( cptrl, cptrO, 512 ) 
LI( cindxO, 0 ) 

LOAD INPUT( aOO, aptrO, aindxO ) /*** begins next sequence *' 
LI( cindxl, 128 ) 

LOAD INPUT ( alO, aptrl, aindxO ) 
LI{ cindx2, 2 56 ) 

LOAD INPUT ( a2 0, aptr2 , aindxO ) 
LI ( cindx3, 384 ) 

LOAD INPUT ( a3 0, aptr3, aindxO ) 
MR ( i count, NR ) 

BLE ( input_loop_dol ) 

LI ( aindx2, 32 ) /* these are used only in two ti 

LOAD INPUT ( a02, aptrO, aindx2 ) 
LI( aindx3, 4 8 ) 

LOAD INPUT ( al2, aptrl, aindx2 ) 
ADDI ( cptr2, cptrl, 512 ) 

LOAD INPUT ( a22, aptr2 , aindx2 ) 
ADDI ( cptr3, cptr2 , 512 ) 

LOAD_INPUT{ a32, aptr3 , aindx2 ) 



Top of input loop processes a 4 x 64 byte tile each trip 
**/ 

LABEL ( input_loop_do2 ) 
/* { */ 

PREFETCH ( dstp, dst code, 0, TCA4 ) 
ADDIC C( icount, icount, -4 ) 

VMRGHW(vtO, aOO, a20) /* vtO = aOO [0-3] a20[0-3] aOO [4-7] 
LOAD INPUT ( aOl, aptrO, aindxl ) 

VMRGLW(vt2, aOO, a20) /* vt2 = a00[8-b] a20[8-b] aOO [c-f ] 
LOAD INPUT ( all, aptrl, aindxl ) 

VMRGHW(vtl, alO, a30) /* vtl = al0[0-3] a30 [0-3] al0[4-7] 
LOAD INPUT ( a21, aptr2, aindxl ) 

VMRGLW(vt3, alO, a30) /* vt3 = al0[8-b] alO [8-b] a30[c-f] 
indxl ) 



LOAD_INPUT{ a31, aptr3, 

VMRGHW{c00, vtO, vtl) /* cOO = 

STORE CACHE ( cOO, cptrO , cindxO ) 

VMRGLW(c01, vtO, vtl) /* cOl = 

STORE CACHE ( cOl, cptrO, cindxl ) 

VMRGHW(c02, vt2 , vt3) /* c02 = 

STORE CACHE ( c02, cptrO, cindx2 ) 

VMRGLW(c03, vt2 , vt3) /* c03 = 

STORE_CACHE( c03, cptrO, cindx3 ) 



a00[0-3] al0[0-3] a20 [0-3] 

a00[4-7] al0[4-7] a20[4-7] 

a00[8-b] al0[8-b] a20[8-b] 

a00[c-f] al0[c-f] a20[c-f] 



a01[0-3] a21[0-33 a01[4-7] 
a01[8-b] a21[8-b] aOl [c-f ] 



VMRGHW {vtO, aOl, a21) /* vtO 

LOAD INPUT ( a03, aptrO, aindx3 ) 

VMRGLW(vt2, aOl, a21) /* vt2 

LOAD INPUT ( al3, aptrl, aindx3 ) 

VMRGHW (vtl, all, a31) /* vtl = all [0-3] a31[0-3] all [4-7] 

LOAD INPUT ( a23, aptr2, aindx3 ) 

VMRGLW(vt3, all, a31) /* vt3 = all [8-b] all [8-b] a31[c-f] 

LOAD_INPUT( a33, aptr3, aindx3 ) 



VMRGHW(clO, vtO, vtl) 
STORE CACHE ( clO, cptrl, 
VMRGLW(cll, vtO, vtl) 



/* clO = a01[0-3] all [0-3] a21[0-3] 
cindxO ) 

/* ell = a01[4-7] all [4-7] a21[4-7] 



a20[4-7] */ 

a20[c-f] */ 

a30[4-7] */ 

a30[c-f] */ 

a30[0-3] */ 

a30[4-7] */ 

a30[8-b] */ 

a30[c-f] */ 

a21[4-7] */ 

a21[c-f] */ 

a31[4-7] */ 

a31[c-f] */ 

a31[0-3] */ 

a31[4-7] */ 
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STORE CACHE ( ell, cptrl, cindxl ) 

VMRGHW(cl2, vt2, vt3) /* cl2 = aOl [8-b] all [8-b] a21[8-b] a31[8-b] */ 

STORE CACHE ( cl2, cptrl, cindx2 ) 

VMRGLW(cl3, vt2, vt3 ) /* cl3 = aOl [c-f ] all [c-f ] a21[c-f] a31[c-f] */ 

STORE_CACHE( cl3, cptrl, cindx3 ) 

BLE ( f lush_input_loop_do2 ) 

ADD ( aindxO, aindxO, TCA4 ) /* bump for next load sequence */ 

ADD ( aindxl, aindxl, TCA4 ) 

ADD ( aindx2, aindx2, TCA4 ) 

ADD ( aindx3, aindx3, TCA4 ) 

VMRGHW(vtO, a02, a22) /* vtO = a02[0-3] a22 [0-3] a02[4-7] a22[4-7] */ 

LOAD INPUT ( aOO, aptrO, aindxO ) /*** begins next sequence ***/ 

VMRGLW(vt2, a02, a22) /* vt2 = a02 [8-b] a22 [8-b] a02 [c-f ] a22 [c-f] */ 

LOAD INPUT ( a02, aptrO, aindx2 ) 

VMRGHW(vtl, al2, a32) /* vtl = al2 [0-3] a32 [0-3] al2 [4-7] a32[4-7] */ 

LOAD INPUT ( alO, aptrl, aindxO ) 

VMRGLW(vt3, al2, a32) /* vt3 = al2 [8-b] al2 [8-b] a32 [c-f ] a32 [c-f ] */ 

LOAD_INPUT( al2, aptrl, aindx2 ) 

VMRGHW(c20, vtO, vtl) /* c20 = a02 [0-3] al2 [0-3] a22 [0-3] a32[0-3] */ 
STORE CACHE ( c20, cptr2 , cindxO ) 

O VMRGLW(c21, vtO, vtl) /* c21 = a02 [4-7] al2 [4-7] a22 [4-7] a32[4-7] */ 
H STORE CACHE ( c21, cptr2 , cindxl ) 

^ VMRGHW(c22, vt2, vt3) /* c22 = a02[8-b] al2 [8-b] a22 [8-b] a32 [8-b] */ 
^ STORE CACHE( c22, cptr2 , cindx2 ) 

m VMRGLW(c23, vt2 , vt3) /* c23 = a02 [c-f ] al2[c-f] a22 [c-f ] a32 [c-f ] */ 
iiO STORE_CACHE( c2 3, cptr2, cindx3 ) 

O VMRGHW(vtO, a03, a23) /* vtO = a03 [0-3] a23 [0-3] a03 [4-7] a23 [4-7] */ 
IP LOAD INPUT ( a20, aptr2, aindxO ) 

„ VMRGLW(vt2, a03, a23) /* vt2 = a03 [8-b] a23 [8-b] a03 [c-f ] a23 [c-f ] */ 
**t LOAD INPUT ( a22, aptr2, aindx2 ) 

W VMRGHW(vtl, al3, a33) /* vtl = al3 [0-3] a33 [0-3] al3[4-7] a33[4-7] */ 
|J LOAD INPUT ( a30, aptr3 , aindxO ) 

VMRGLW(vt3, al3 , a33) /* vt3 = al3 [8-b] al3 [8-b] a33 [c-f ] a33 [c-f ] */ 
■'J.. LOAD_INPUT( a32, aptr3, aindx2 ) 

O VMRGHW(c30, vtO, vtl) /* c30 = a03 [0-3] al3[0-3] a23 [0-3] a33 [0-3] */ 

STORE CACHE ( c3 0, cptr3 , cindxO ) 

?y VMRGLW(c31, vtO, vtl) /* c31 = a03 [4-7] al3[4-7] a23 [4-7] a33 [4-7] */ 
STORE CACHE ( c31, cptr3, cindxl ) 

VMRGHW(c32, vt2 , vt3) /* c32 = a03[8-b] al3 [8-b] a23 [8-b] a33 [8-b] */ 
STORE CACHE ( c32, cptr3 , cindx2 ) 

VMRGLW(c33, vt2 , vt3) /* c33 = a03 [c-f ] al3 [c-f ] a23 [c-f ] a33 [c-f] */ 
STORE_CACHE( c33, cptr3 , cindx3 ) 

ADDK cindxO, cindxO, 16 ) /* bump for next store sequence */ 

ADDI ( cindxl, cindxl, 16 ) 

ADDI { cindx2, cindx2, 16 ) 

ADDI { cindx3, cindx3, 16 ) 

BR ( input_loop_do2 ) 

LABEL ( flush_input_loop_do2 ) 

a02[4-7] a22[4-7] */ 
a02[c-f] a22[c-f] */ 



VMRGHW(vtO 


a02 


a22) 


/* 


vtO 




a02 


[0 


3] 


a22 [0 


3] 


VMRGLW (vt2 


a02 


a22) 


/* 


vt2 




a02 


[8 


b] 


a22 [8 


b] 


VMRGHW(vtl 


al2 


a32) 


/* 


vtl 




al2 


[0 


3] 


a32 [0 


-3] 


VMRGLW (vt3 


al2 


a32) 


/* 


vt3 




al2 


[8 


-b] 


al2 [8 


-b] 


VMRGHW(c20 


vtO 


vtl) 


/* 


c2 0 




a02 


[0 


-3] 


al2 [0 


-3] 


STORE CACHE ( 


c20, 


cptr2 , 


cindxO 










al2 [4 


-7] 


VMRGLW ( c 2 1 


vtO 


vtl) 


/* 


c21 




a02 


[4 


-7] 


STORE CACHE ( 


c21. 


cptr2 , 


cindxl 















a32[c-f] a32[c-f] 
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VMRGHW(c22, vt2 , vt3) 
STORE CACHE ( c22 , cptr2 , 

VMRGLW(C23, vt2 , vt3) 
STORE_CACHE( c23, cptr2 , 

VMRGHW(vtO, a03, a23) 

VMRGLW(vt2, a03, a23) 

VMRGHW(vtl, al3, a33) 

VMRGLW(vt3, al3, a33) 

VMRGHW(c30, vtO, vtl) 
STORE CACHE ( C30, cptr3 , 

VMRGLW(c31, vtO, vtl) 
STORE CACHE ( c31, cptr3 , 

VMRGHW(c32, vt2, vt3) 
STORE CACHE ( c32, cptr3 , 

VMRGLW(c33, vt2, vt3) 
STORE_CACHE( c33, cptr3 , 

MR ( outptrO, C ) 
SLWI ( Cbump, TCC, 6 ) 
ADD I ( A, A, 64 ) 
ADD ( C, C, Cbump ) 
LI( icount, 64 ) 
BR ( output_start ) 



/* c22 = a02[8-b] al2[8-b] a22 [8-b] 
cindx2 ) 

/* c23 = a02 [c-f ] al2[c-f] a22 [c-f ] 



/* vtO = a03[0-3] a23[0-3] a03[4-7] 

/* vt2 = a03[8-b] a23 [8-b] a03 [c-f ] 

/* vtl = al3[0-3] a33[0-3] al3[4-7] 

/* vt3 = al3[8-b] al3 [8-b] a33 [c-f ] 

/* c30 = a03[0-3] al3[0-3] a23 [0-3] 
cindxO ) 

/* c31 = a03[4-7] al3[4-7] a23[4-7] 



2/23/2001 
a32[8-b] */ 
a32[c-f] */ 



cindxl ) 
/* c32 = 

cindx2 ) 
/* c33 = 

cindx3 ) 



a03[8-b] al3[8-b] a23 [8-b] 
a03[c-f] al3 [c-f ] a23 [c- 



a23 [4-7] 
a23 [c-f] 
a33 [4-7] 
a33 [c-f] 

a33 [0-3] 

a33 [4-7] 

a33 [8-b] 



/* set for output loop in current pass */ 



/* bump C for next pass */ 

/* set icount for 2 tiles */ 

/* join to common output loop */ 



] a33[c-f] */ 



Top < 



input loop processes a 4 x 32 byte tile each trip 
input_loop_dol ) 



.a00[0-3] a20[0-3] a00[4-7] 



LABEL ( 
/* { */ 

PREFETCH ( dstp, dst code, 0, TCA4 ) 
ADDIC C( icount, icount, -4 ) 

VMRGHW(vt0, aOO, a20) /* vtO = 

LOAD INPUT ( aOl, aptrO, aindxl ) 

VMRGLW(vt2, aOO, a20) /* vt2 = aOO [8-b] a20 [8-b] a00[c-f] 

LOAD INPUT < all, aptrl, aindxl ) 

VMRGHW(vtl, alO, a30) /* vtl = al0[0-3] a30 [0-3] al0[4-7] 

LOAD INPUT { a21, aptr2, aindxl ) 

VMRGLW(vt3, alO, a30) /* vt3 = al0[8-b] alO [8-b] a30[c-f] 

LOAD_INPUT( a31, aptr3, aindxl ) 



VMRGHW(c00, vtO, vtl) 
STORE CACHE ( C0 0, cptrO , 

VMRGLW(c01, vtO, vtl) 
STORE CACHE ( cOl, cptrO. 

VMRGHW(c02, vt2 , vt3) 
STORE CACHE ( c02, cptrO 

VMRGLW(c03, vt2, vt3) 
STORE_CACHE( c0 3, cptrO 

BLE ( f lush_input_loop_dol ) 

ADD ( aindxO, aindxO, TCA4 ) 
ADD ( aindxl, aindxl, TCA4 ) 



: a00[0-3] al0[0-3] a20 [0-3] 
a00[4-7] al0[4-7] a20 [4-7] 



/* C00 
cindxO ) 
/* cOl 
cindxl ) 

/* c02 = a00[8-b] alO [8-b] a20[8-b] 
cindx2 ) 

/* c03 = a00[c-f] al0[c-f] a20 [c-f ] 
cindx3 ) 



a20[4-7] */ 

a20[c-f] */ 

a30[4-7] */ 

a30[c-f] */ 

a30[0-3] */ 

a30[4-7] */ 

a30[8-b] */ 

a30[c-f] */ 



/* bump for next load sequence */ 



VMRGHW(vt0, aOl, a21) /* vtO 

LOAD INPUT ( aOO, aptrO , aindxO ) 

VMRGLW(vt2, aOl, a21) /* vt2 

LOAD INPUT ( alO, aptrl, aindxO ) 

VMRGHW(vtl, all, a31) /* vtl 

LOAD INPUT ( a2 0, aptr2, aindxO ) 

VMRGLW(vt3, all, a31) /* vt3 

LOAD_INPUT( a30, aptr3, aindxO ) 



vtl) 
cptrl, 



/* clO = 
cindxO ) 



a01[0-3] a21[0-3] a01[4-7] a21[4-7] */ 
/*** begins next sequence ***/ 

: a01[8-b] a21[8-b] aOl [c-f ] a21[c-f] */ 

: all [0-3] a31[0-3] all[4-7] a31[4-7] */ 

= all [8-b] all [8-b] a31[c-f] a31 [c-f ] */ 

= a01[0-3] all[0-3] a21 [0-3] a31[0-3] */ 
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VMRGLW(cll, vtO, vtl) /* cll = 

STORE CACHE { cll, cptrl, cindxl ) 

VMRGHW(cl2, Vt2, vt3) /* Cl2 = 

STORE CACHE ( cl2, cptrl, cindx2 ) 

VMRGLW(cl3, vt2, vt3) /* cl3 ■■ 

STORE_CACHE( cl3 , cptrl, cindx3 ) 



2/23/2001 

a01[4-7] all [4-7] a21[4-7] a31[4-7] */ 
a01[8-b] all[8-b] a21 [8-b] a31[8-b] */ 
a01[c-f] all[c-f] a21[c-f] a31[c-f] */ 



ADDI ( cindxO, cindxO, 16 ) 

ADDI ( cindxl, cindxl, 16 ) 

ADDI ( cindx2, cindx2, 16 ) 

ADDI ( cindx3, cindx3, 16 ) 

BR ( input_loop_dol ) 

LABEL ( f lush_input_loop_dol ) 



/* bump for next store sequence */ 



VMRGHW(vtO, 
VMRGLW(vt2, 
VMRGHW(vtl, 
VMRGLW(vt3, 



a21) 
a21) 
a31) 
a31) 

VMRGHW(clO, vtO, vtl) 
STORE CACHE ( clO, cptrl, 

VMRGLW(cll, vtO, vtl) 
STORE CACHE ( cll, cptrl, 

VMRGHW(cl2, vt2, vt3) 
STORE CACHE { Cl2, cptrl, 

VMRGLW(cl3, vt2, Vt3) 
STORE_CACHE( cl3, cptrl, 

MR { outptrO, C ) 
SLWI ( Cbump, TCC, 5 ) 
ADDI ( A, A, 32 ) 
ADD ( C, C, Cbump ) 
LI( icount, 32 ) 



/* clO = 
cindxO ) 

/* cll = 
cindxl ) 

/* cl2 : 
cindx2 ) 

/* cl3 = 
cindx3 ) 



= a01[0-3] a21[0-3] a01[4-7] a21[4-7] 

: a01[8-b] a21[8-b] aOl [c-f ] a21[c-f] 

: all [0-3] a31[0-3] all [4-7] a31[4-7J 

: all [8-b] all [8-b] a31 [c-f ] a31[c-f] 

: a01[0-3] all [0-3] a21[0-3] a31[0-3] 

= a01[4-7] all [4-7] a21[4-7] a31[4-7] 

= a01[8-b] all [8-b] a21[8-b] a31[8-b] 

= a01[c-f] all [c-f] a21[c-f] a31[c-f] 



/* set for output loop in current pass */ 



bump C for next pass */ 
set icount for 1 tile */ 



Second stage of transposition, write output 

LABEL ( output_start ) 

CMPW_CR( 6, icount, NCJLeft ) 

MR ( cptr, LI cache ) 

SLWI( TCC4, TCC, 2 ) 
LI { cindxO, 0 ) 
LI ( cindxl, 16 ) 

LI ( cindx2, 2*16 ) 

LI( cindx3, 3*16 ) 

LI( cindx4, 4*16 ) 

LI ( cindx5, 5*16 ) 

LI ( cindx6, 6*16 ) 

BLE_CR( 6, PC OFFSET ( 8 ) ) 
MR ( icount, NC_left ) 

LI ( cindx7, 7*16 ) 

SUB( NC_left, NC_left, icount ) 

ADDIC C( icount, icount, -4 ) 
LI ( out indxO , 0 ) 

LOAD CACHE ( cO, cptr, cindxO ) 
ADD ( out indxl, out indxO , TCC ) 

LOAD CACHE ( cl, cptr, cindxl ) 
ADD ( out indx2, out indxl, TCC ) 

LOAD CACHE ( c2, cptr, cindx2 ) 
ADD ( out_indx3, out_indx2 , TCC ) 
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LOAD CACHE ( c3 , cptr, cindx3 ) 

ADDI ( outptrl, outptrO, 16 ) 

LOAD CACHE ( c4 , cptr, cindx4 ) 

VPERM( vtO, cO, cl, vpO ) 

LOAD CACHE ( c5, cptr, cindx5 ) 

VPERM( vtl, cO, cl, vpl ) 

LOAD CACHE ( c6 , cptr, cindx6 ) 

VPERM( vt2, c2, c3, vpO ) 

LOAD CACHE ( c7, cptr, cindx? ) 

VPERM( vt3, C2, c3, vpl ) 
ADDI ( cptr, cptr, 128 ) 
BR ( output_mloop ) 

/** 

Loop outputs four 32 byte rows 
**/ 

LABEL ( output loop ) 

ADDIC_C( icount, icount, -4 ) 
ADDI { cptr, cptr, 128 ) 

STORE OUTPUT ( outO, outptrO, out_indx0 ) 

VPERM( out4, vt4, vt6, vp2 ) 

STORE OUTPUT ( out4, outptrl, out_indx0 ) 

VPERM( out5, vt4, vt6, vp3 ) 

STORE OUTPUT ( outl, outptrO, out_indxl ) 

VPERM ( out 6, vt5, vt7 , vp2 ) 

STORE OUTPUT ( out5, outptrl, out_indxl ) 

VPERM ( OUt7, vt5, vt7, vp3 ) 

STORE OUTPUT ( out2 , outptrO, out_indx2 ) 

VPERM ( vtO, cO, cl, vpO ) 

STORE OUTPUT ( out6, outptrl, out_indx2 ) 

VPERM ( vtl, cO, Cl, vpl ) 

STORE OUTPUT ( out3 , outptrO, out_indx3 ) 

VPERM ( vt2, C2, C3, vpO ) 

STORE OUTPUT ( out7, outptrl, out_indx3 ) 

VPERM ( vt3, c2, c3, vpl ) 

ADD ( outptrO, outptrO, TCC4 ) 
ADD ( outptrl, outptrl, TCC4 ) 

LABEL ( output mloop ) 

BLE ( flush output_loop ) 

LOAD CACHE ( cO, cptr, cindxO ) 

VPERM ( vt4, c4, c5, vpO ) 
LOAD CACHE ( cl , cptr, cindxl ) 

VPERM { vt5, c4, c5, vpl ) 
LOAD CACHE ( c2 , cptr, cindx2 ) 

VPERM ( vt6, c6, C7, vpO ) 
LOAD CACHE ( c3 , cptr, cindx3 ) 

VPERM ( vt7, c6, c7, vpl ) 

LOAD CACHE ( c4 , cptr, cindx4 ) 

VPERM ( outO, vtO, vt2, vp2 ) 
LOAD CACHE ( c5 , cptr, cindx5 ) 

VPERM ( outl, VtO, vt2, vp3 ) 
LOAD CACHE ( c6, cptr, cindx6 ) 

VPERM ( OUt2, vtl, vt3, vp2 ) 
LOAD CACHE ( c7, cptr, cindx7 ) 

VPERM ( out3, vtl, vt3, vp3 ) 

BR ( output_loop ) 

LABEL ( f lush_output_loop ) 

VPERM ( vt4, c4, c5, vpO ) 
VPERM ( vt5, c4, c5, vpl ) 
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VPERM{ vt6, c6, c7, vpO ) 
VPERM( vt7, c6, c7, vpl ) 



CMPWI ( icount, -3 ) 

VPERM( outO, vtO, 
STORE OUTPUT ( outO , 
VPERM( out4, vt4, 
STORE OUTPUT ( out4 , 
BEQ ( oloop_next ) 

CMPWI ( icount, -2 ) 

VPERM( outl, vtO, 
STORE OUTPUT ( outl, 
VPERM ( OUt5, vt4, 
STORE OUTPUT ( out 5, 
BEQ ( oloop_next ) 

CMPWI ( icount, -1 ) 

VPERM ( out2, vtl, 
STORE OUTPUT ( OUt2, 
VPERM ( out 6, vt5, 
STORE OUTPUT ( out 6, 
BEQ ( oloop_next } 

VPERM ( out 3, vtl, 
STORE OUTPUT ( out 3, 

VPERM ( out7, vt5, 
STORE_OUTPUT ( out 7, 



vt2, vp2 ) 

outptrO, out_indx0 ) 
vt6, vp2 ) 

outptrl , out_indxO ) 



vt2, vp3 ) 

outptrO, out_indxl ) 
vt6, vp3 ) 

outptrl, out_indxl ) 



vt3, vp2 } 

outptrO, out_indx2 ) 
vt7, vp2 ) 

outptrl , out_indx2 ) 



vt3, vp3 ) 

outptrO, out_indx3 ) 
vt7, vp3 ) 

outptrl, out_indx3 ) 



Next four rows of C? 



Exit routine 
LABEL ( ret ) 

FREE THRU v31 ( VRSAVE_COND ) 

REST rl3_r28 

RETURN 



/* branch if icount < NC__left */ 



FUNC EPILOG 
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mtr iangle_8bit . mac 



MC Standard Algorithms -- PPC Macro language Version 



File Name: mtriangle_8bit .mac . 

Description: Move from an upper triangular matrix stored 
as a series of 32 -line rectangles, each of 
width 32 elements less than its immediate 
predecessor to the upper triangle of an 
full N x N matrix. 

mtriangle_8bit ( char *A, char *C, int N ) 

Restrictions: A, B and C must all be 16-byte aligned. 

N must be a multiple of 16 and >= 16. 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision 
0.0 



Date 
000605 



Engineer Reason 
jg Created 



#include "salppc.inc" 

#define LOAD A( vT, rA, rB ) 
ttdefine LOAD C{ vT, rA, rB ) 
#define ST0RE_C( vS, rA, rB ) 

#define R MATRIX ALIGN_LOG 5 
ttdefine R MATRIX ALIGN 
#define R_MATR I X_AL I GN_MAS K 

#define ALTIVEC ALIGN_LOG 
#define ALTIVEC ALIGN 
ttdefine ALTIVEC_ALIGN_MASK 



LVXL( vT, rA, rB ) 
LVX( vT, rA, rB ) 
STVX( vS, rA, rB ) 



(1 << R MATRIX ALIGN_LOG) 
(R_MATRIX_ALIGN - 1) 



(1 « ALTIVEC ALIGN_LOG) 
(ALTIVEC_ALIGN - 1) 



Input parameters 
**/ 

ttdefine A 
ttdefine C 
ttdefine N 

ttdefine A tcols 
ttdefine C tcols 
ttdefine altivec N 
ttdefine A counter 
ttdefine indexO 
ttdefine indexl 
ttdefine index2 
ttdefine index3 

ttdefine count 

ttdefine aO 

ttdefine al 

ttdefine a2 

ttdefine a3 

ttdefine cO 

ttdefine shift 

ttdefine shift_incr 

ttdefine mask 

ttdefine left 

ttdefine right 



rlO 
rll 
rl2 
rl3 



1 
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FUNC_PROLOG 

ENTRY_3 ( mtriangle_8bit, A, C, N ) 



ill 



ADD I ( A counter, N, R MATRIX ALIGN_MASK ) 

VSPLTISW( shift_incr, 8 ) 
ADDI ( altivec N, N, ALTIVEC ALIGN_MASK ) 

VXOR( shift, shift, shift ) 
RLWINM ( A counter, A counter, 0, 0, (31 - 
RLWINM ( altivec N, altivec N, 0, 0, (31 - 
ADDI ( C_tcols, A_counter, 1 ) 

LABEL ( oloop ) 

ADDIC C( count, altivec_N, -64 ) 
LOAD C( CO, 0, C ) 

VSPLTISW( mask, -1 ) 
LOAD A( a0, 0, A ) 

VSRO( mask, mask, shift ) 
LI( indexO, 16 ) 

VANDC( left, cO, mask ) 
LI ( indexl, 32 ) 

VAND ( right, aO, mask ) 
LI( index2, 4 8 ) 

VOR( cO, left, right ) 
STORE C( cO, 0, C ) 
BLE ( dosmall ) 
LI ( index3, 64 ) 

LABEL ( iloop ) 

LOAD A( aO, A, indexO ) 
ADDIC C( count, count, -64 ) 

LOAD A( al, A, indexl ) 

LOAD A( a2, A, index2 ) 

LOAD A( a3, A, index3 ) 

STORE C( aO, C, indexO ) 
ADDI ( indexO, indexO, 64 } 

STORE C( al, C, indexl ) 
ADDI ( indexl, indexl, 64 ) 

STORE C( a2, C, index2 ) 
ADDI ( index2, index2, 64 ) 

STORE C( a3, C, index3 ) 
ADDI ( index3 , index3 , 64 ) 
BGT { iloop ) 

LABEL ( dosmall ) 

ADDIC C( count, count, 4 8 ) 
BLE ( windout ) 

LABEL ( sloop ) 

ADDIC C( count, count, -16 ) 
LOAD A( aO, A, indexO ) 
STORE C( aO, C, indexO ) 
ADDI ( indexO, indexO, 16 ) 
BGT ( sloop ) 

LABEL ( windout ) 
DECR_C( N ) 

VADDUWM( shift, shift, shift_incr ) 
ADDI { A counter , A_counter , - 1 ) 
ADDI ( A, A, 1 ) 

ADDI ( A tcols, A counter, R_MATR I X_ALI GN_MAS K ) 
DECR ( altivec_N ) 



R MATRIX ALIGN LOG) ) 
ALT I VEC_AL I GN_LOG ) ) 
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RLWINM ( A tcols, A_tcols, 0, 0, (31 - R_MATRIX_ALIGN_LOG) ) 
ADD ( C, C, C tcols ) 
ADD ( A, A, A_tcols ) 
BNE ( oloop ) 

FREE THRU_v9 ( VRSAVE_COND ) 

REST rl3 

RETURN 

FUNC_EPILOG 



P 
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2/23/2001 



#if I defined ( SALPPC_H ) 
#define SALPPC_H 



MC Standard Algorithms 



PPC Version 



File Name: 
Description: 

Source files should have extension .mac. For example, vadd.mac 
and must include this file (salppc.h). 

To assemble for PPC ucode, use the following basic 
makefile build rule: 



ccmc -c 
rm -f $' 
rm -f $' 



$*.s -E $*.c 
-o $*.o $*.s 



To compile for C, use the following basic makefile build rule: 
.SUFFIXES: .mac .c .o 



rm - f $ * . c 

The first 8 function arguments are passed in GPR registers 
r3 - rlO Arguments beyond 8 are passed on the stack and may 
be obtained with the GET_ARG8 , GET_ARG9 , ... GET ARG15 macros. ' 
Additional GPR registers should be assigned in ascending order 
starting from the last function argument. These may be declared 
with the DECLARE_rx[ ry] macros. For example, a function with 
5 arguments that requires 3 additional GPR registers would 
issue: DECLARE r8 rlO . rO , if required, should be declared 
separately with the DECLARE rO macro. GPR registers above rl2 
must be saved and restored using the SAVE_rl3 [_ry] and 
REST_rl3 [_ry] macros, respectively. 

FPR registers should be assigned in ascending order starting 
with f0[d0] . These may be declared with the DECLARE_f 0 [_fy] 
or DECLARE dO [ dy] macros . 

For example, DECLARE fO fll. FPR registers above f 13 [dl3] must 
be saved and restored using the SAVE f 14 [ fy] and REST f 14 [_fy] 
or SAVE_dl4 [_dy] and REST_dl4 [_dy] macros, respectively. 

All variables must be assigned a register using the 
pre-processor #define directive. GPR registers are named 
rO - r31; Single precision FPR registers are named fO - f31. 
Double precision FPR registers are named dO - d31 Different 
variables may be assigned to the same register as in: 

#define vara fl2 
#define varb fl2 

Functions must begin with the FUNC_PROLOG 
with the FUNC_EPILOG macro. 
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Macros are provided for both Fortran and C entry points. 

The GET SAL CACHE macro should be used to get the address of 
the "current" salcache buffer into a GPR register. 

Avoid terminating macro lines with a semicolon. 

The following example demonstrates typical usage: 

# include "salppc.h" 



/* 



assign variables to registers 
*/ 

#define A r3 
#define I r4 
#define B r5 
#define J r6 
#define C r7 
#define K r8 
#define D r9 
#define L rlO 
#define N rl2 
ttdefine EFLAG rll 
#define count rll 



#define to rl3 

#define tl rl3 

#define t2 rl4 

#define t3 rl4 

#define t4 rl5 

#define t5 rl5 

#define t6 rl6 

#define aO fO 

#define al fl 

#define a2 f2 

#define a3 f3 

#define bO f4 

ttdefine bl f5 

#define b2 f6 

#define b3 f7 

#define cO f8 

#define cl f9 

#define c2 flO 

#define c3 fll 

#define dO fl2 

#define dl fl3 

#define d2 fl4 

#define d3 fl5 

FUNC_PROLOG 

#if ! defined ( COMPILE_C ) 
U ENTRY (foo ) 
FORTRAN DREF 4(1, J, K, 
FORTRAN_DREF_ARG8 

U ENTRY (foo) 
LI (EFLAG, 0) 
BR (common) 

U ENTRY (foo x ) 
FORTRAN DREF 4(1, J, K, 
FORTRAN DREF ARG8 
FORTRAN_DREF_ARG 9 

#endif 



/* must precede function */ 
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ENTRY 10 (foo x. A, I, 
DECLARE rl3 rl6 
DECLARE fO fl5 
GET_ARG9 ( EFLAG ) 

LABEL ( common) 

SAVE CR 
SAVE rl3 rl6 
SAVE fl4_fl5 
SAVE_LR 

GET_ARG8 ( N ) 



J, C, K, D, L, N, EFLAG) 
* get the 9'th arg (EFLAG) off stack */ 

/* needed if using fields 2,3 or 4 */ 

/* needed if making a function call */ 
/* get the 8 ' th arg (N) off stack */ 



/* 



body of function ... */ 



REST CR 
REST rl3 rl6 
REST fl4_fl5 
REST LR 
RETURN 

FUNC_EPILOG 



/* must conclude function */ 



Mercury Computer Systems, Inc. 
Copyright (c) 1996 All rights reserved 

Date Engineer; Reason 



0.2 



970521 
980813 



0.3 
0.4 
#endif 
#include • 



ttdefine uchar unsigned char 
#define ulong unsigned long 
#define ushort unsigned short 



jg; Created 

jfk; Added POSTING BUFFER COUNT and made 

TEST IF DCBZ macro time "stw" instead 
of doing the TEST IF DCBT macro (lwz) 

jfk; Added SALCACHE ALLOC SIZE , 

ALIGN SALCACHE, CREATE_SALCACHE_FRAME 
DESTROY SALCACHE FRAME 

jfk; Added SET DCB [TZ] COND macros. 
Made old macros not assemble 

jfk; Changes SALCACHE ALLOC SIZE for 750 

/* header */ 



#define CR _c 
#define CTR 
#define VSCR 



* define a structure to represent a VMX register 

typedef union { 

char c [16] ; 

uchar uc [16] ; 

short s [8] ; 

ushort us [ 8 ] ; 

long 1C4]; 

ulong ul [4] ; 

float f[4]; 
} VMX_reg; 

#define FUNC_PROLOG 
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#define FUNC EPILOG \ 
} 

ttdefine TEXT_SECTION ( logb2_align ) 
#define DATA_SECTION ( logb2_align ) 
#define RODATA_SECTION ( logb2_align ) 
/* 

* macro for C extern declarations 

# define EXTERN_DATA{ symbol ) \ 
extern long symbol; 

^define EXTERN_FUNC ( func ) \ 
extern void func ( void ) ; 

/* 

* macro for a global declaration 
*/ 

fldefine GLOBAL ( symbol ) 
/* 

* macro for a local declaration 
*/ 

tt define LOCAL ( symbol ) 
/* 

* macros for creating static arrays 
*/ 

#define START_ARRAY ( type, name ) \ 
type name## [] = { 

^define START C ARRAY ( name ) START ARRAY ( char, name ) 
^define START UC ARRAY ( name ) START ARRAY ( uchar, name ) 
^define START S ARRAY ( name ) START ARRAY ( short, name ) 
# define START US ARRAY ( name ) START ARRAY ( ushort, name ) 
#define START L ARRAY ( name ) START ARRAY ( long, name ) 
fldefine START UL ARRAY ( name ) START ARRAY ( ulong, name ) 
#define START_F_ARRAY ( name ) START_ARRAY( float, name ) 

define END_ARRAY \ 
b 

#define DATA ( dl ) \ 



#define DATA2 ( dl , d2 ) \ 
dl, d2, 

#define DATA4 ( dl , d2 , d3 , d4 ) \ 
dl, d2, d3, d4, 

#define DATA8 ( dl, d2, d3, d4, d5, d6 , 
dl, d2, d3, d4, d5, d6, d7, d8, 

ttdefine C DATA ( dl ) DATA ( dl ) 

#define UC DATA ( dl ) DATA ( dl ) 

#define S DATA ( dl ) DATA ( dl ) 

#define US DATA ( dl ) DATA ( dl ) 

#define L DATA ( dl ) DATA ( dl ) 

#define UL DATA ( dl ) DATA ( dl ) 

#define F_DATA ( dl ) DATA ( dl ) 

#if defined ( LITTLE_ENDIAN ) 
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#define D_DATA( dl, d2 ) DATA2 ( d2 , dl ) 
#else 

#define D_DATA( dl, d2 ) DATA2 ( dl, d2 ) 
#endif 



#def ine 


C DATA2 ( dl , 


d2 ) 




DATA2 ( 


dl. 


d2 ! 










#def ine 


UC DATA2 ( dl, 


d2 


) 


DATA2 { 


dl. 


d2 ] 










#def ine 


S DATA2 ( dl , 


d2 ) 




DATA2 ( 


dl. 


d2 ; 










#def ine 


US DATA2 ( dl , 


, d2 


) 


DATA2 ( 


dl. 


d2 : 










#def ine 


L DATA2 ( dl , 


d2 ) 




DATA2 ( 


dl, 


d2 ) 










#define 


UL DATA2 ( dl , 


. d2 


) 


DATA2 ( 


dl, 


d2 ; 










#define 


F_DATA2 ( dl , 


d2 ) 




DATA2 ( 


dl, 


d2 ! 










#define 


C DATA4 ( dl , 


d2, 


d3, 


d4 ) 


DATA4 ( 


dl, 


d2, 


d3, 


d4 


#def ine 


UC DATA4 ( dl , 


. d2, 


d3, d4 ) 


DATA4 { 


dl, 


d2, 


d3, 


d4 


#define 


S DATA4 ( dl , 


d2, 


d3, 


d4 ) 


DATA4 ( 


dl, 


d2, 


d3. 


d4 


#define 


US DATA4 ( dl , 


. d2, 


d3, d4 ) 


DATA4 ( 


dl, 


d2, 


d3, 


d4 


#def ine 


L DATA4 ( dl , 


d2. 


d3, 


d4 ) 


DATA4 ( 


dl, 


d2, 


d3. 


d4 


#define 


UL DATA4 ( dl , 


, d2, 


d3, d4 ) 


DATA4 ( 


dl, 


d2, 


d3. 


d4 


#def ine 


F DATA4 ( dl , 


d2. 


d3, 


d4 } 


DATA4 ( 


dl. 


d2, 


d3, 


d4 



#define C DATA 8 ( dl, d2 , d3, d4 , d5 , d6 , d7, d8 ) \ 

DATA 8 { dl , d2 , d3 , d4 , d5 , d6 , d7 , d8 ) 
#define UC DATA 8 ( dl, d2, d3 , d4 , d5, d6, d7 , d8 ) \ 

DATA8 ( dl, d2, d3, d4 , d5 , d6 , d7, d8 ) 
#define S DATA8 ( dl, d2 , d3 , d4 , d5, d6 , d7, d8 ) \ 

DATA 8 ( dl , d2 , d3 , d4 , d5 , d6 , d7 , d8 ) 
#define US DATA8 ( dl, d2, d3, d4 , d5, d6 , d7 , d8 ) \ 

DATA 8 ( dl, d2, d3 , d4 , d5 , d6 , d7, d8 ) 
#define L DATA 8 ( dl, d2 , d3 , d4 , d5, d6, d7, d8 ) \ 

DATA 8 ( dl, d2, d3 , d4 , d5 , d6 , d7 , d8 ) 
#define UL DATA8 ( dl, d2, d3, d4, d5, d6, d7, d8 ) \ 

DATA 8 ( dl , d2 , d3 , d4 , d5 , d6 , d7 , d8 ) 
#define F DATA 8 ( dl, d2 , d3 , d4 , d5, d6, d7, d8 ) \ 

DATA 8 ( dl, d2, d3 , d4 , d5 , d6 , d7 , d8 ) 

/* 

* macros for creating vrax permute masks (128-bits) 
*/ 

#if defined ( LITTLE_ENDIAN ) 



#define 


L 


PERMUTE MUNGE ( 


1 ) 


( (1) 


* Oxlclclclc 


#define 


S 


PERMUTE MUNGE ( 


s ) 


( (s) 


* Oxlele ) 


#define 


C_ 


_PERMUTE_MUNGE ( 


c ) 


( (c) 


* Oxlf ) 


#def ine 


L 


INDEX MUNGE ( x 


) 1 


; (x) * 


0x3 ) 


#def ine 


S 


INDEX MUNGE ( x 


) < 


( (x) * 


0x7 ) 


#define 


C_ 


_INDEX_MUNGE ( x 


) ' 


: (x) A 


Oxf ) 


#else 












#def ine 


L 


PERMUTE MUNGE ( 


1 ) 


( 1 ) 




#define 


S 


PERMUTE MUNGE ( 


s ) 


( s ) 




#define 


c_ 


_PERMUTE_MUNGE ( 


c ) 


( c ) 




#define 


L 


INDEX MUNGE ( X 


) 


( x ) 




ttdefine 


S 


INDEX MUNGE ( x 


) 


( x ) 




ttdefine 


c 


_INDEX_MUNGE ( x 


) 


( x ) 





#define L PERMUTE MASK ( 11, 12, 13, 14 ) \ 

L PERMUTE MUNGE ( 11 ) , L PERMUTE MUNGE ( 12 ) , \ 

L_PERMUTE_MUNGE ( 13 ), L_PERMUTE_MUNGE ( 14 ), 

#define S PERMUTE MASK ( si, s2 , s3 , s4 , s5, s6, 
S_PERMUTE_MUNGE ( Si ) , S_PERMUTE_MUNGE ( s2 ) , \ 
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S PERMUTE MUNGE { s3 ) , S PERMUTE MUNGE ( s4 ) , \ 
S PERMUTE MUNGE ( s5 ) , S PERMUTE MUNGE ( s6 ) , \ 
S_PERMUTE_MUNGE { s7 ), S_PERMUTE_MUNGE ( s8 ), 



define C_PERMUTE_MASK ( cl, c2, 



c4, c5, c6, c7, c8. 



c9, clO, ell, cl2, cl3, 
C PERMUTE MUNGE ( c2 ) , 
PERMUTE MUNGE { c4 ) , 
C PERMUTE MUNGE ( c6 ) , 
C PERMUTE MUNGE ( c8 ) , 
C PERMUTE MUNGE ( clO ) 
C PERMUTE MUNGE ( Cl2 ), \ 
C PERMUTE MUNGE { cl3 ), C PERMUTE MUNGE ( cl4 ), \ 
C_PERMUTE_MUNGE ( cl5 ), C_PERMUTE_MUNGE ( cl6 ), 

/* 



C PERMUTE MUNGE ( cl ) , 
C PERMUTE MUNGE ( c3 ) , 
C PERMUTE MUNGE ( c5 ) , 
C PERMUTE MUNGE { C7 ) , 
C PERMUTE MUNGE ( 
C PERMUTE MUNGE ( Cll ) , 



14, cl5, C16 ) \ 



\ 
\ 
\ 

\ 



*/ 
#defi] 



dcrocode entry point 
"nop" for C code 



(e.g. vaddx , vaddx_) 



U ENTRY ( func_name ) 



macros for C function prototypes 



#define 


C PROTOTYPE^ 


void 


func_name ( a 


#define 


C PROTOTYPE^ 


void 


func_name ( ] 


#def ine 


C PR0T0TYPE_: 


void 


func_name ( ] 


#define 


C PROTOTYPE^ 


void 


func_name ( ] 


#define 


C PROTOTYPE_< 


void 


func_name ( ] 


^define 


C PROTOTYPE_E 


void 


func_name ( ] 


#def ine 


C PROTOTYPE^ 


void 


func_name ( ] 


#define 


C PROTOTYPE^ 


void 


func_name ( Z 


#define 


C PROTOTYPE_S 


void 


func_name ( I 


ttdefine 


C PROTOTYPED 


void 


func_name ( Z 



long ) ; 

long, long ) ; 

long, long, long ) ; 

long, long, long, long ) ; 

long, long, long, long, long ) ; 

long, long, long, long, long, \ 



#define C PROTOTYPE_l 0 { func name ) \ 
void func_name ( long, long, long, 
long, long ) ; 

#def ine C PR0T0TYPE_11 ( func name ) \ 
void func_name ( long, long, long, 
long, long, long ) 

#def ine C PR0T0TYPE_12 ( func name ) \ 
void func_name ( long, long, long, 
long, long, long, 



long, long, long, long, long, \ 
long, long, long, long, long, \ 



long, long, long, long, long, \ 
long ) ; 
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#def ine C PR0T0TYPE_13 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long ) ; 

#def ine C PROTOTYPE_14 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long ) ,- 

#def ine C PR0T0TYPE_15 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long, long ) ; 

#def ine C PROTOTYPE_16 ( func name ) \ 

void func_name ( long, long, long, long, long, long, long, long, \ 
long, long, long, long, long, long, long, long ) ; 



P 

Q 



#define 


AUTO_ 


_r3 r31 \ 






















long 


r3 , 


r4, r5, r6 


, r7, 


r8, 


r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, 


\ 


rl8 , 


rl9, r20, 


r21. 


r22, 


r23, 


r24, 


r25, 


r26, 


r27, 


r28, 


r29. 


r30, 




r31 ; 
























#def ine 


AUTO_ 


r4 r31 \ 




















rl7, \ 


long 


r4 , 


r5, r6, r7 


, r8. 


r9. 


rlO, 


rll, 


rl2, 


rl3, 


rl4, rl5, 


rl6, 


rl8 , 


rl9, r20, 


r21, 


r22, 


r23, 


r24, 


r25, 


r26, 


r27. 


r28, 


r29, 


r30, 




r31 ; 
























#def ine 


AUTO_ 


r5 r31 \ 






















long 


r5. 


r6, r7, r8 


, r9. 


rlO, 


rll. 


rl2, 


rl3, 


rl4. 


rl5, 


rl6, 


rl7. 


\ 


rl8 , 


. rl9, r20, 


r21, 


r22, 


r23. 


r24, 


r25, 


r26. 


r27, 


r28, 


r29. 


r30, 




r31 ; 
























#def ine 


AUTO 


r6 r31 \ 






















long 


r6,~ 


r7, r8, r9 


, no 


, rll 


, rl2, rl3 


, rl4 


, rl5 


, rl6, 


rl7, \ 




rl8, 


, rl9, r20. 


r21, 


r22, 


r23. 


r24. 


r25, 


r26. 


r27, 


r28, 


r29, 


r30, 




r31; 
























#def ine 


AUTO_ 


r7 r31 \ 






















long 


r7 , 


r8, r9, rlO, rll, rl2, rl3, rl4, rl5, rl6, rl7, \ 






rl8, 


. rl9, r20, 


r21. 


r22, 


r23, 


r24, 


r25, 


r26, 


r27. 


r28, 


r29, 


r30, 




r31; 
























#def ine 


AUTO 


r8 r31 \ 
























r8,~ 


"r9, rlO, r 


11, rl2, rl3, r 


14, rl5, r 


16, rl7, \ 








rl8 , 


, rl9, r20, 


r21, 


r22, 


r23, 


r24, 


r25, 


r26, 


r27, 


r28, 


r29. 


r30, 




























#define 


AUTO 


r9 r31 \ 






















long 


r9,~ 


"rlO, rll, 


rl2, : 


rl3, 


rl4, 


rl5, 


rl6. 


rl7. 


\ 








rl8, 


. rl9, r20, 


r21, 


r22. 


r23, 


r24, 


r25, 


r26, 


r27, 


r28, 


r29, 


r30. 




r31, 
























#define 


AUTO 


rlO r31 \ 






















long 


no, 


, rll, rl2. 


rl3, 


rl4, 


rl5. 


rl6, 


rl7. 


\ 










rl8, 


, rl9, r20. 


r21, 


r22, 


r23. 


r24, 


r25. 


r26, 


r27, 


r28. 


r29. 


r30. 




r31, 
























#define 


AUTO 


rll r31 \ 






















long 


rll, 


. rl2, rl3, 


rl4, 


rl5, 


rl6, 


rl7, 


\ 












rl8, 


, rl9, r20. 


r21, 


r22. 


r23. 


r24, 


r25, 


r26, 


r27, 


r28, 


r29, 


r30, 




r31, 
























#def ine 


AUTO 


rl2 r31 \ 






















long 


rl2, 


. rl3, rl4. 


rl5. 


rl6. 


rl7, 


\ 














rl8, 


, rl9, r20, 


r21, 


r22, 


r23, 


r24, 


r25, 


r26, 


r27, 


r28. 


r29, 


r30, 


#define 


r31, 
AUTO 


rl3 r31 \ 




















r25, \ 




rl3, 


, rl4, rl5, 


rl6. 


rl7, 


rl8, 


rl9, 


r20, 


r21. 


r22, 


r23, 


r24. 


r26, 


, r27, r28, 


r29. 


r30, 


r31; 
















#def ine 


AUTO 


rl4 r31 \ 




















\ 


long 


rl4, 


, rl5, rl6, 


rl7, 


rl8, 


rl9, 


r20, 


r21, 


r22, 


r23, 


r24, 


r25. 


r26, 


, r27, r28, 


r29, 


r30, 


r31; 
















#def ine 


AUTO 


r!5 r31 \ 






















long 


rl5, 


, rl6, rl7, 


rl8, 


rl9. 


r20. 


r21, 


r22. 


r23, 


r24. 


r25, 


\ 




r26. 


, r27, r28, 


r29, 


r30. 


r31; 
















#def ine 


AUTO 


rl6 r31 \ 
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long 


rl6, 


rl7, rl8. 


rl9, 


r20. 


r21. 


r22, 


r23, r24, 


r25, 


r26. 


r27, r28, 


r29, 


r30, 


r31; 








$ define 


















long 


rl7, 


rl8, rl9. 


r20, 


21 


r22 


r23, 


r24, r25, 


\ 


r26, 


r27, r28. 


r29, 


r3o' 


r3li 








#define 


AUTO 


rl8 r31 \ 














long 


rl8. 


rl9, r20. 


r21, 


r22. 


r23, 


r24, 


r25, \ 




r26. 


r27, r28. 


r29, 


r30, 


r31; 








#define 


AUTO 


rl9 r31 \ 














long 


rl9, 


r20, r21, 


r22. 


r23, 


r24, 


r25. 


\ 




r26, 


r27, r28, 


r29, 


r30, 


r31; 








#define 


AUTO 


fO f31 \ 














float fO, 


fl, f2, f3, f4 


, £5, 


f6. 


f7, f8, f9, flO, 


fll, 




fl5, fl6, fl7. 


fl8 


, fl9 


, f20, f21, 


, f22, f23. 


f24, 




f2{ 


S, f29, f30, 


f31 













fl3, fl4, \ 
f26, f27, \ 

0, f31; 

#define AUTO do d31 \ 

double dO, dl, d2 , d3, d4 , d5, d6, d7, d8 , d9, dlO, dll, dl2, dl3, dl4, \ 
dl5, dl6, dl7, dl8, dl9, d20, d21, d22, d23, d24, d25, d26, d27, \ 
d28, d29, d30, d31; 

#if defined( BUILD MAX ) 
#define AUTO v0_v31 \ 

VMX_reg vO , vl , v2, v3, v4 , v5, v6 , v7 , v8 , v9 , vlO, vll, vl2, vl3, vl4 , 

vl5, vl6, vl7, vl8, vl9, v20, v21, v22, v23, v24, v25, v26, v27, \ 
v28, v29, v30, v31; 

#endif 

^ * For C implementation, create a dummy stack on function entry of size 
4096. 
*/ 

#define STACK_SIZE 4 096 

// * macros for C and Fortran callable entry points 

#def ine ENTRY 0 ( func name ) \ 
C PROTOTYPE 0{ func name ) \ 
void func_name ( void ) \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r3 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ,- \ 
long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ] ,- \ 
long stack [STACK_SIZE + 4], sp,- 

#define ENTRY 1( func name, argO } \ 
C PROTOTYPE 1 ( func name ) \ 
void func_name ( long argO } \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 

AUTO r4 r31 \ 

AUTO fO f31 \ 

AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 

long fpr save area [ 2*18 + 4 ]; \ 

long vr save area [ 4*12 + 4 ] ; \ 

long stack[STACK_SIZE + 4], sp; 
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#define ENTRY 2( func name, argO, argl ) \ 
C PROTOTYPE 2 ( func name ) \ 
void func name ( long argO, long argl ) \ 
{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO r5 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY 3( func name, argO, argl, arg2 ) \ 
C PROTOTYPE 3 ( func name ) \ 

void func_name ( long argO, long argl, long arg2 ) \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO r6 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY 4( func name, argO, argl, arg2, arg3 ) \ 
C PROTOTYPE 4 ( func name ) \ 

void func_name ( long argO, long argl, long arg2 , long arg3 ) \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO r7 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY 5( func name, argO , argl, arg2 , arg3, arg4 ) \ 
C PROTOTYPE 5 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4 ) \ 

^ \ong CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO r8 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ],- \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY 6( func name, argO , argl, arg2, arg3, arg4, arg5 ) \ 
C PROTOTYPE 6 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5 ) \ 

long CR[8] ; ulong CTR; ulong VSCR; long rO; \ 
AUTO r9 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr_save_area [ 19 + 4 ] ; \ 
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long fpr save area [ 2*18 + 4 ] ,- \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_7 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 
arg6 ) \ 
C PROTOTYPE 7( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4 , \ 
long arg5, long arg6 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long r0; \ 
AUTO rlO r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_8 ( func_name, argO , argl, arg2, arg3 , arg4, arg5, \ 
arg6, arg7 ) \ 
C PROTOTYPE 8 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7 ) \ 

{ \ 

long CR[8] ; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rll r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ] ,- \ 
long vr save area [ 4*12 +4 ]; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_9 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 
arg6, arg7, arg8 ) \- 
C PROTOTYPE 9( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6 , long arg7 , long arg8 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rl2 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ,- \ 
long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRYl 0 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 
arg6, arg7, arg8, arg9 ) \ 
C PROTOTYPE 10 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8, long arg9 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rl3 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ] ,- \ 
long stack [STACK_SIZE + 4], sp; 
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#define ENTRY_11 ( func_name, argO, argl, arg2, arg3 , arg4, arg5, \ 
arg6, arg7, arg8, arg9, arglO ) \ 
C PROTOTYPE 11 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8, long arg9, \ 
long arglO ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long r0; \ 
AUTO rl4 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ] ,* \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_12 { func_name, argO, argl, arg2 , arg3, arg4, arg5 , \ 

arg6, arg7, arg8 , arg9, arglO, argil ) \ 
C PROTOTYPE 12 ( f unc name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4 , \ 
long arg5, long arg6, long arg7, long arg8, long arg9, \ 
I* long arglO, long argil ) \ 

O { \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
*rf AUTO rl5 r31 \ 

il AUTO fO f31 \ 

if| AUTO dO d31 \ 

AUTO_v0 V31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ]; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_13 { func_name, argO, argl, arg2, arg3, arg4, arg5 , \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2 ) \ 
C PROTOTYPE 13 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7 , long arg8, long arg9, \ 
long arglO, long argil, long argl2 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rl6 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 +4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_14 { func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9 , arglO, argil, \ 
argl2, argl3 ) \ 
C PROTOTYPE 14 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4, \ 
long arg5, long arg6, long arg7, long arg8, long arg9, \ 
long arglO, long argil, long argl2, long argl3 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl7 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr_save_area [ 19 + 4 ] ; \ 

11 
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long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

#define ENTRY_15 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, argl4 ) \ 
C PROTOTYPE 15 ( func name ) \ 

void func_name ( long argO, long argl, long arg2, long arg3, long arg4 , \ 
long arg5, long arg6, long arg7, long arg8, long arg9, \ 
long arglO, long argil, long argl2, long argl3, \ 
long argl 4 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO ; \ 
AUTO rl8 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 ] ; \ 
long fpr save area [ 2*18 + 4 ] ; \ 
long vr save area [ 4*12 + 4 ] ; \ 
long stack [STACK_SIZE + 4], sp; 

I s * #define ENTRY_1 6 ( func_name, argO , argl, arg2, arg3, arg4, arg5, \ 

p arg6, arg7, arg8, arg9, arglO, argil, \ 

argl2, argl3, argl4, argl5 ) \ 
^5 C PROTOTYPE 16 ( func name ) \ 

W void func_name ( long argO, long argl, long arg2, long arg3, long arg4 , \ 

~,Q long arg5, long arg6, long arg7, long arg8 ( long arg9, \ 

long arglO, long argil, long argl2 , long argl3, \ 
long argl4, long argl5 ) \ 

{ \ 

long CR[8]; ulong CTR; ulong VSCR; long rO; \ 
AUTO rl9 r31 \ 
AUTO fO f31 \ 
AUTO dO d31 \ 
AUTO_v0 v31 \ 

long gpr save area [ 19 + 4 J ; \ 
long fpr save area [ 2*18 + 4 ],- \ 
long vr save area [ 4*12 + 4 J; \ 
long stack [STACK_SIZE + 4], sp; 

/* 

* macros to get GPR arguments beyond 8 
*/ 

#define GET ARG8 ( rD ) 
#def ine GET ARG9 { rD ) 
#define GET ARG1 0 { rD ) 
#def ine GET ARG11 ( rD ) 
#def ine GET ARG12 ( rD ) 
#def ine GET ARG13 { rD ) 
#def ine GET ARG14 ( rD ) 
#def ine GET ARG15 ( rD ) 
#def ine GET ARG16 ( rD ) 
#def ine GET_ARG17 ( rD ) 

/* 

* macros to set GPR arguments beyond 8 
*/ 

#def ine SET ARG8 ( rD ) 
#define SET ARG9 { rD ) 
#def ine SET ARG10 ( rD ) 
#def ine SET ARG11 ( rD ) 
#def ine SET ARG12 ( rD ) 
#def ine SET ARG13 ( rD ) 
#def ine SET ARG14 ( rD ) 
#def ine SET_ARG15 ( rD ) 
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/* 

* macro to branch from one entry point to another 
*/ 

#define BR FUNC( func_name ) \ 
f unc_name ( ) ; \ 

/* 

* macros to call functions 
*/ 

ttdefine CALL_FUNC{ func_name ) \ 
f unc_name ( ) ; 

/* 

* macros to call functions 
*/ 

#def ine CALL_0 ( f unc_name ) \ 

#define CALL_1 ( func name, argO ) \ 
func_name ( argO ) ; 





#def ine 




2 ( func_name. 


argO, 


argl ) 


\ 
















func_ 


name 


( argO, argl ] 






















ttdefine 


CALL_ 


3 ( func_name, 


argO, 


argl. 


arg2 ) 


\ 














func_ 


_name 


( argO, argl, 


arg2 ; 


) ; 


















ttdefine 


CALL_ 


_4 ( func_name, 


argO, 


argl. 


arg2, 


arg3 ) 


\ 










Q 


func_ 




( argO, argl, 


arg2. 


arg3 ) 
















J 


















) \ 










ttdefine 


CALL_ 


5 { func_name, 


argO, 


argl. 


arg2. 


arg3 , 


arg4 










func_ 


_name 


( argO , argl , 


arg2. 


arg3. 


arg4 ) 
















ttdefine 


CALL_ 


6 ( f unc_name. 


argO, 


argl, 


arg2, 


arg3. 


arg4 , 


arg5 ' 


l \ 








f unc_ 


name 


( argO, argl, 


arg2, 


arg3, 


arg4 , 


arg5 ) ; 












ttdefine 


CALL_ 


1 ( func_name. 


argO , 


argl, 


arg2, 


arg3. 


arg4 , 


arg5 , 


arg6 ! 


> \ 






func_ 




"( argO, argl. 


arg2. 


arg3, 




arg5 , 


arg6 


) ; 










ttdefine 


CALL_ 


8 ( func_name. 


argO, 


argl, 


arg2. 










arg7 ) 


\ 




f unc_ 


name 


( argO, argl. 


arg2. 


arg3, 


arg4. 


arg5, 


arg6, 


arg7 ' ) 


1 ; 








ttdefine 


CALL_ 


9 ( func name, 
arg8 ) \ 


argO, 


argl, 


arg2, 


arg3, 


arg4, 


arg5, 


arg6, 


arg7, 


\ 




func_ 




( argO, argl, 


arg2. 


arg3, 


arg4, 


arg5, 


arg6, 


arg7 , 


\ 












arg8 ) ; 






















ttdefine 


CALL_ 


10 ( func name, 


, argO, argl. 


arg2. 


arg3 , 


arg4 


, arg5, 


, arg6, 


arg7 , 


\ 








arg8, arg9 ) \ 




















func_ 


name 


( argO, argl. 


arg2. 


arg3. 


arg4 , 


arg5, 


arg6 , 


arg7 , 


\ 












arg8 , arg9 ] 






















ttdefine 


CALL_ 


11 ( func name, 


. argO, 


, argl. 


arg2. 


arg3, 


arg4 


, arg5. 


, arg6 , 


• arg7 , 


\ 








arg8, arg9, arglO ) \ 


















func_ 


name 


( argO, argl, 


arg2. 


arg3, 


arg4, 


arg5, 


arg6, 


arg7, 


\ 












arg8 , arg9. 


arglO 


) ; 

















ttdefine CALL_12 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, 
arg8, arg9, arglO, argil ) \ 
func_name ( argO , argl, arg2, arg3 , arg4 , arg5, arg6, arg7 , \ 
arg8, arg9, arglO, argil ); 

ttdefine CALL_13 ( func name, argO, argl, arg2 , arg3, arg4, arg5, arg6, 
arg8, arg9, arglO, argil, argl2 ) \ 
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func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, arg!2 ) ,- 

#define CALL_14 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2 , argl3 ) \ 
func_name ( argO , argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3 ) ; 

ttdefine CALL_15 ( func name, argO, argl, arg2 , arg3, arg4, arg5, argS, arg7, \ 
arg8, arg9, arglO, argil, argl2 , argl3, argl4 ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7 , \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4 ); 

ttdefine CALL_16 ( func name, argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 
arg8, arg9, arglO, argil, argl2, argl3, argl4, argl5 ) \ 
func_name ( argO, argl, arg2, arg3, arg4, arg5, arg6, arg7, \ 

arg8, arg9, arglO, argil, argl2, argl3, argl4, argl5 ) ; 

#if defined ( BUILD_MAX ) 



a 



* G4 macros to create a dummy jump table. 

* (not supported in C) 
*/ 

ttdefine DECLARE VMX VI ( root name 

ttdefine DECLARE VMX V2 ( root name 

ttdefine DECLARE VMX V3 ( root name 

ttdefine DECLARE VMX V4 ( root name 

ttdefine DECLARE_VMX_V5 ( root_name 

ttdefine DECLARE VMX Zl ( root name 

ttdefine DECLARE VMX Z2 ( root name 

ttdefine DECLARE VMX Z3 ( root name 

ttdefine DECLARE VMX Z4 ( root name 

ttdefine DECLARE_VMX_Z5 ( root_name 



* G4 macros to decide whether to enter a VMX loop 

* VMX loop is entered if at least minimum count, 

* all vectors have the same relative alignment 

* (i.e., same lower 4 bits) and all strides are unit. 

* Note, a unit s imm argument is provided because some 

* packed interleaved complex functions (stride 2) such 

* as cvaddxO can be implemented with a VMX loop. 

* Only one macro should be invoked per source file. 

* (not supported in C) 
*/ 

ttdefine BR IF VMX VI ( root name, min n imm, unit s imm, pi, si, n, eflag ) 
ttdefine BR_IF_VMX_V1_ALIGNED ( root name, min n_imm, unit_s_imm, \ 



pi, si, 

ttdefine BR_IF_VMX_V2 ( root name, min n : 
pi, si, p2, s2, n 
ttdefine BR_IF_VMX_V2_LS ( root name, min 
pi, si, ps, s2 
ttdefine BR_IF_VMX_V2_LC ( root name, min 
pi, si, pc, n, 
ttdefine BR_IF_VMX_V2_ALIGNED ( root name 
si, p2 



eflag ) 
.mm, unit_; _ 

eflag ) 
n imm, unit_s_imm 

n, eflag ) 
n imm, unit_s_imm 
eflag ) 
min n imm, unit_ 
s2, n, eflag ) 



ttdefine BR_IF_VMX_V3 ( root name, min n imm, unit_s : 

pi, si, p2, s2, p3 
ttdefine BR_IF_VMX_V3_ALIGNED ( root name 

pi, si, p2, s2, p3, s3, 
ttdefine BR_IF_VMX_V4 ( root name, min n imm, unit s imir 

pi, si, p2, s2, p3, s3, p4, s4, 
ttdefine BR_IF_VMX_V4_ALIGNED ( root name, min n imt 
si, p2, s2, p3, 
imm, unit i 



ttdefine BR_IF_VMX_V5 ( root_name, 



33, p4, s4, n, eflag ) 
_imm, \ 
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pi, si, p2, s2, p3, s3, p4, s4, p5, s5, n, eflag ) 
#define BR_IP_VMX_V5_ALIGNED ( root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, s3, p4, s4, p5 , s5, n, 
eflag ) 

#define BR_IF_VMX_Z1 { root_name, min n_imm, unit_s_imm, \ 



prl, pil, 
#define BR_IF_VMX_Z2 { root_name 
prl, pil, 

#def ine BR_IF_VMX_Z3 ( root_name 
prl, pil, 

#def ine BR_IF_VMX_Z4 { root_name 
prl, pil, 
pr4, pi4, 

#def ine BR_IF_VMX_Z5 { root_name 



eflag ) 
min n imm, unit s imm, \ 
si, pr2, pi2, s2, n, eflag ) 

min n imm, unit s imm, \ 
si, pr2, pi2, s2, pr3, pi3, s3, n, e 

min n imm, unit s imm, \ 
si, pr2, pi2, s2, pr3, pi3, s3, \ 
s4, n, eflag ) 
min n imm, unit s imm, \ 
prl, pil, si, pr2, pi2, s2, pr3, pi3, s3, \ 
pr4, pi4, s4, pr5, pi5, s5, n, eflag ) 
#define BR_I F_VMX_CONV ( root name, min n imm, \ 

pi, si, s2, p3, s3, n, eflag ) 
#define BR_I F_VMX_ZCONV ( root_name, min n imm, \ 

prl, pil, si, s2, pr3, pi3, s3, n, eflag ) 

/* 

* G4 macro to get VMX unaligned (FP) count 

s s a * assumes all vectors have the same relative alignment 

i-aj * and that the last 2 bits of ptr are 0 

* sets condition code CRO 

P */ 

if? #define GET_VMX_UNALIGNED_COUNT ( count, ptr ) \ 

M { \ 

'~ (count) = - (ptr); \ 

tfl (count) = ( (count) >> 2) & 3; \ 

p CR[0] = (long) (count) ; \ 

P } 

n /* 

;ri * G4 macro to get VMX unaligned short count 

ri * assumes that the last bit of ptr is 0 

^ * sets condition code CRO 

|4 */ 

?* #define GET VMX UNALIGNED_COUNT_S ( count, ptr ) \ 

S { \ 

M (count) = - (ptr) ; \ 

Pd (count) = ( (count) >> 1) & 7; \ 

^ CR[0] = (long) (count) ; \ 

/* 

* G4 macro to get VMX unaligned char count 

* sets condition code CRO 
*/ 

#define GET VMX UNALIGNED COUNT_C ( count, ptr ) \ 
{ \ 

(count) = - (ptr) \ 
(count) = (count) & 15; \ 
^ CR[0] = (long) (count) ; \ 

/* 

* G4 macro to load and splat an FP scalar independent of alignment 
*/ 

#define SCALAR_SPLAT ( vt, vtmp, scalarp ) \ 

(vt).f[0] = (vt).f[l] = (vt).f[2] = (vt).f[3] = *scalarp; 

#endif /* end BUILD_MAX */ 

/* 

* cache (DCBT and DCBZ) macros. 
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*/ 

#define DCBT TRUE ( cond_bit, scratch ) \ 

CR[ (cond_bit) ] = -1; /* true (<= 0) 



_bit, scratch ) \ 

/* false (> 0) */ 

#define DCBZ FALSE ( cond_bit, scratch ) \ 
DCBT_FALSE ( cond_bit , scratch ) 

#define SET DCBT COND ( cond bit, cache bit, eflag, scratchl ) \ 
CR[ (cond_bit) ] = (eflag & (cache_bit) ) ,- 

#define SET_DCBZ_COND ( cond bit, cache bit, eflag, buffer, stride, \ 
unit stride, count, tmpl, tmp2, tmp3) \ 
CR [ (cond_bit) ] = (eflag & (cache_bit) ) ; 

#define DCBT IF( cond bit, rA, rB ) \ 
if ( CR[(cond bit)] <= 0 ) \ 
{ DCBT ( rA, rB ) } 

O #define DCBZ IF( cond bit, rA, rB ) \ 

^ if ( CR[(cond bit)] <= 0 ) \ 
4, ~:f { DCBZ( rA, rB ) } 

Ji #define DCBT IF CACHABLE ( cond_bit, rA, rB ) \ 

DCBT IF( cond bit, rA, rB ) 

'«=? 

O #define DCBZ IF CACHABLE ( cond_bit, rA, rB ) \ 

ifl DCBZ_IF( cond_bit, rA, rB ) 

* #define BR IF CACHABLE ( cond bit, label ) \ 
Q if ( CR[(cond bit)] <= 0 ) \ 

|jj goto label; 

H 5 #define BR IF NOT CACHABLE ( cond_bit , label ) \ 
»|S if ( CR[(cond bit)] > 0 ) \ 

fH goto label; 

T4 /* 

* ASIC macros 
*/ 

#if defined ( COMPILE_PREFETCH ) 

#define LOAD PREFETCH CONTROL ( mode, scratchl, scratch2 ) \ 
* (volatile long *) PREFETCH_CONTROL = (mode) ; 

#define LOAD MISCON B( mode, scratchl, scratch2 ) \ 
* (volatile long *)MISCON_B = (mode) ; 

#define RESET PREFETCH CONTROL ( scratchl, scratch2 ) \ 
{ \ 

volatile long i; \ 
i = * (volatile long *)MISCON_B; \ 
i &= PREFETCH MASK; \ 
i |= USE PREFETCH CONTROL; \ 
^ * (volatile long * ) PREFETCH_CONTROL = i; \ 

#else 

#define LOAD PREFETCH CONTROL ( mode, scratchl, scratch2 ) 
#define LOAD MISCON B( mode, scratchl, scratch2 ) 
#define RESET_PREFETCH_CONTROL ( scratchl, scratch2 ) 
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#endif 
/* 

* instruction macros 
*/ 

#define ADD ( rD, rA, rB ) 

#define ADD_C ( rD, rA, rB ) 
(long) (rD) ; 

#define ADDI { rD, rA, SIMM ) 

ttdefine ADDIC_C ( rD, rA, SIMM ) 
rD) ; 

#define ADDIS ( rD, rA, SIMM ) 

#define AND ( rA, rS, rB ) 

ttdefine AND_C( rA, rS, rB ) 
(long) (rA) ,- 

#define ANDC ( rA, rS, rB ) 
ttdefine ANDC_C( rA, rS, rB ) 
(long) (rA) ; 

ttdefine ANDI_C( rA, rS, UIMM ) 
rA) ; 

ttdefine ANDIS_C( rA, rS, UIMM ) 

ttdefine BA( addr ) 

ttdefine BCTR 

ttdefine BEQ ( label ) 

ttdefine BEQ PLUS ( label ) 

ttdefine BEQ MINUS ( label ) 

ttdefine BEQ CR ( bit, label ) 

ttdefine BEQ CR PLUS ( bit, label ) 

ttdefine BEQ CR_MINUS ( bit, label ) 

ttdefine BEQLR 

ffl ttdefine BEQLR PLUS 

ttdefine BEQLR MINUS 

ttdefine BEQLR CR( bit ) 

ttdefine BEQLR CR PLUS ( bit ) 

ttdefine BEQLR CR MINUS ( bit ) 

ttdefine BGE ( label ) 

ttdefine BGE PLUS ( label ) 

ttdefine BGE MINUS ( label ) 

ttdefine BGE CR ( bit, label ) 

ttdefine BGE CR PLUS ( bit, label ) 

ttdefine BGE CR_MINUS ( bit, label ) 

ttdefine BGELR 

ttdefine BGELR PLUS 

ttdefine BGELR MINUS 

ttdefine BGELR CR( bit ) 

ttdefine BGELR CR PLUS ( bit ) 

ttdefine BGELR CR MINUS ( bit ) 

ttdefine BGT ( label ) 

ttdefine BGT PLUS ( label ) 

ttdefine BGT MINUS ( label ) 

ttdefine BGT CR ( bit, label ) 

ttdefine BGT CR PLUS ( bit, label ) 

ttdefine BGT CR_MINUS ( bit, label ) 

ttdefine BGTLR 

ttdefine BGTLR PLUS 

ttdefine BGTLR MINUS 

ttdefine BGTLR CR( bit ) 

ttdefine BGTLR CR PLUS ( bit ) 

ttdefine BGTLR CR MINUS ( bit ) 

ttdefine BL ( func name ) 

ttdefine BLE ( label ) 

ttdefine BLE PLUS ( label ) 

ttdefine BLE MINUS ( label ) 

ttdefine BLE CR ( bit, label ) 

ttdefine BLE_CR_PLUS ( bit, label ) 



(rD) = 


(rA) 


+ 


(rB) ; 




(rD) = 


(rA) 




(rB) ; CR[0] = 




(rD) = 


(rA) 




(SIMM) ; 




(rD) = 


(rA) 




(SIMM) ; CR[0] = 


C 


(rD) = 


(rA) 




( (SIMM) << 16) ; 




(rA) = 


(rS) 


& 


(rB) ; 




(rA) = 


(rS) 


& 


(rB) ; CR[03 = 




(rA) = 


(rS) 


& 


~(rB) ; 




(rA) = 


(rS) 


& 


~(rB) ; CR[0] = 




(rA) = 


(rS) 


& 


(UIMM) ; CR[0] = 


c 


(rA) = 


(rS) 


& 


( (UIMM) « 16) ; 


\ 




CR[0] 




■■ (long) (rA) ; 





goto (addr) ; 
( * (void ( * ) (void) ) CTR) ( ) ; 
if ( CR[0] == 0 ) goto label; 
BEQ ( label ) 
BEQ ( label ) 

if ( CR[(bit)] == 0 ) goto label; 

BEQ CR( bit, label ) 

BEQ CR( bit, label ) 

if ( CR[0] == 0 ) return; 

BEQLR 

BEQLR 

if ( CR[(bit)] == 0 ) return; 

BEQLR CR( bit ) 

BEQLR CR( bit ) 

if ( CR[0] >= 0 ) goto label; 

BGE ( label ) 

BGE ( label ) 

if ( CR[(bit)] >= 0 ) goto label; 

BGE CR( bit, label ) 

BGE CR( bit, label ) 

if ( CR[0] >= 0 ) return; 

BGELR 

BGELR 

if ( CR[(bit)] >= 0 ) return; 

BGELR CR( bit ) 

BGELR CR( bit ) 

if ( CR[0] > 0 ) goto label; 

BGT ( label ) 

BGT ( label ) 

if ( CR[(bit)] > 0 ) goto label; 

BGT CR( bit, label ) 

BGT CR( bit, label ) 

if ( CR[0] > 0 ) return; 

BGTLR 

BGTLR 

if ( CR[(bit)] > 0 ) return; 
BGTLR CR( bit ) 
BGTLR CR( bit ) 
f unc_name ( ) ; 

if ( CR[0] <= 0 ) goto label ; 
BLE ( label ) 
BLE ( label ) 

if ( CR[(bit)] <= 0 ) goto label; 
BLE_CR( bit, label ) 
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ttdefine BLE CR_MINUS ( bit, label 

#define BLELR 

ttdefine BLELR PLUS 

#define BLELR MINUS 

#define BLELR CR ( bit ) 

#def ine BLELR CR PLUS ( bit ) 

ttdefine BLELR_CR_MINUS ( bit ) 

ttdefine BLR 

ttdefine BLT ( label ) 

ttdefine BLT PLUS ( label ) 

ttdefine BLT MINUS ( label ) 

ttdefine BLT CR( bit, label ) 

ttdefine BLT CR PLUS ( bit, label ) 

ttdefine BLT CR_MINUS ( bit, label 

ttdefine BLTLR 

ttdefine BLTLR PLUS 

ttdefine BLTLR MINUS 

ttdefine BLTLR CR ( bit ) 

ttdefine BLTLR CR PLUS ( bit ) 

ttdefine BLTLR CR MINUS ( bit ) 

ttdefine BNE ( label ) 

ttdefine BNE PLUS ( label ) 

ttdefine BNE MINUS ( label ) 

ttdefine BNE CR( bit, label ) 

ttdefine BNE CR PLUS ( bit, label ) 

ttdefine BNE CR_MINUS ( bit, label 

ttdefine BNELR 

ttdefine BNELR PLUS 

ttdefine BNELR MINUS 

ttdefine BNELR CR{ bit ) 

ttdefine BNELR CR PLUS ( bit ) 

ttdefine BNELR CR MINUS ( bit ) 

ttdefine BR ( label ) 

ttdefine CLRLWI ( rA, rS, nbits ) 

ttdefine CLRLWI C( rA, rS, nbits ) 

\ 



: CMPLW( rA, rB ) 



ttdefine CMPLW_CR( bit, rA, rB ) 
? \ 



ttdefine CMPLWI ( rA, UIMM ) 



ttdefine CMPW( rA, rB ) 
ttdefine CMPW CR ( bit, rA, rB ) 
ttdefine CMPWI ( rA, SIMM ) 
ttdefine CMPWI_CR ( bit, rA, SIMM ) 
ttdefine DCBF ( rA, rB ) 
ttdefine DCBI ( rA, rB ) 
ttdefine DCBST ( rA, rB ) 
ttdefine DCBT ( rA, rB ) 
ttdefine DCBTST ( rA, rB ) 
ttdefine DCBZ ( rA, rB ) 



BLE CR( bit, label ) 

if ( CR[0] <= 0 ) return; 

BLELR 
BLELR 

if ( CR[(bit)] <= 0 ) return; 
BLELR CR( bit ) 
BLELR CR( bit ) 
return; 

if ( CR[0] < 0 ) goto label; 
BLT { label ) 
BLT ( label ) 

if ( CR[(bit)] < 0 ) goto label; 

BLT CR( bit, label ) 

BLT CR( bit, label ) 

if ( CR[0] < 0 ) return; 

BLTLR 

BLTLR 

if ( CR[(bit)] < 0 ) return; 

BLTLR CR( bit ) 

BLTLR CR( bit ) 

if ( CR[0] != 0 ) goto label; 

BNE ( label ) 

BNE ( label ) 

if ( CR[(bit)] != 0 ) goto label; 

BNE CR{ bit, label ) 

BNE CR( bit, label ) 

if ( CR[0] != 0 ) return; 

BNELR 

BNELR 

if < CR[(bit)] != 0 ) return; 
BNELR CR( bit ) 
BNELR CR( bit ) 
goto label; 

(rA) = (rS) & ( (1 « (32-nbits)) 
(rA) = (rS) & ((1 << (32-nbits)) 



CR[0] = (long) (rA) ; 
(rA) = (rS) & ~((1 << nbits) 
(rA) = (rS) & ~((1 << nbits) 

CR[0] = (long) (rA) ; 
CR[0] = ( ( (rA)* (rB) ) & (1 « 
( (rB) - (rA) ) : ( (rA) - 
CR[(bit)] = (<(rA)*(rB)) & C 



3D) ? \ 
(rB) ) ; 
« 31) ) 



* (long 
* (long 



((UIMM) - (rA)) : ( (rA) - 
(UIMM) ) ; 

CR[(bit)] = ( ( (rA) * (UIMM) ) & (1 

((UIMM) - (rA) ) : ( (rA) - 

(UIMM) ) ; 
CR[0] = (rA) - (rB) ; 
CR[ (bit) ] = (rA) - (rB) ; 
CR[0] = (rA) - (SIMM) ; 
CR[(bit)] = (rA) - (SIMM) ; 



) ( ( (rA) + (rB) ) & -CACHE LINE MASK) = 0 ; 
) ( ( ( (rA) + (rB) ) & -CACHE LINE MASK) +4) = 
) ( ( ( (rA) + (rB) ) & -CACHE LINE MASK) +8) = 
) ( ( ( (rA) + (rB) ) & ~ CACH E_L I NE_MAS K ) +12) 



0; \ 
0; \ 
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#define DECR ( rD ) 
#define DECR C ( rD ) 
#define DIVW ( rD, rA, rB } 
#define DIVW_C ( rD, rA, rB ) 
(long) (rD) ; 

ttdefine DIVWU( rD, rA, rB ) 
#define DIVWU_C( rD, rA, rB ) 

ttdefine EQV( rA, rS, rB ) 
ttdefine EQV_C( rA, rS, rB ) 

#define FABS ( frD, ErB ) 
- (frB) ; 

#define FADD ( frD, frA, frB ) 
#define FADDS ( frD, frA, frB ) 
ttdefine FCMPO( bit, frA, frB ) \ 
{ \ 

if ( (frA) < (frB) ) CR[(bit)] = 
else if ( (frA) > (frB) ) CR[(bi 
else CR[ (bit) ] = 0; \ 

} 

ttdefine FCMPU( bit, frA, frB ) 
ttdefine FCTIW( frD, frB ) 
ttdefine FCTIWZ ( frD, frB ) \ 

{ \ 

union { \ 

long i[2] ; \ 

double d; \ 
} u, \ 

u.i [0] = (long) (frB) ; \ 
u.i[l] = 0; \ 
(frD) = u.d; \ 

} 

ttdefine FDIV( frD, frA, frB ) 
ttdefine FDIVS ( frD, frA, frB ) 
ttdefine FMADD ( frD, frA, frC, frB) 
ttdefine FMADD S ( frD, frA, frC, frB) 
ttdefine FMOV( frD, frB ) 
ttdefine FMR ( frD, frB ) 
ttdefine FMUL ( frD, frA, frB ) 
ttdefine FMULS ( frD, frA, frB ) 
ttdefine FMSUB ( frD, frA, frC, frB ) 
ttdefine FMSUBS ( frD, frA, frC, frB ) 
ttdefine FNABS ( frD, frB ) 
(frB) ; 

ttdefine FNEG ( frD, frB ) 

ttdefine FNMADD ( frD, frA, frC, frB ) 

ttdefine FNMADD S ( frD, frA, frC, frB ) 

ttdefine FNMSUB ( frD, frA, frC, frB ) 

ttdefine FNMSUBS ( frD, frA, frC, frB ) 

ttdefine FRES ( frD, frB ) 

ttdefine FRSP( frD, frB ) 

ttdefine FRSQRTE ( frD, frB ) 

ttdefine FSEL ( frD, frA, frC, frB ) 

(frB) f 

ttdefine FSUB ( frD, frA, frB ) 
ttdefine FSUBS ( frD, frA, frB ) 
ttdefine GOTO( label ) 
ttdefine INCR ( rD ) 
ttdefine INCR_C ( rD ) 



\ (long 


*) <(((rA)+ 


(rB) ) 


& 


~CACHE_ 


_LINE_MASK)+16) 


= 0 


* (long 

\ 

* (long 
\ 


*) (U(rA) + 


(rB) ) 


& 


~CACHE_ 


_LINE_MASK)+20) 


= 0 


*) (U(rA) + 


(rB) ) 


& 


~CACHE_ 


_LINE_MASK) + 24) 


= 0 


* (long 


*) (( ((rA)+ 


(rB)) 


& 


~CACHE_ 


_LINE_MASK) +2 8) 


= 0 



-- (rD) ; 
-- (rD) ; CRtO] 



(rD) 
(rD) 



(long) (rD) ; 
(rA) / (rB); 
(rA) / (rB) ; CR[0] = 



(rD) = (ulong) (rA) / (ulong) (rB) ; 
(rD) = (ulong) (rA) / (ulong) (rB) ; \ 

CR[0] = (long) (rD) ; 
(rA) = ~( (rS) A (rB) ) ; 
(rA) = ~( (rS) A (rB) ) ; \ 
CR[0] = (long) (rA) ; 
(frD) = ((frB) >= 0.0) ? (frB) : 

(frD) = (frA) + (frB) ; 
(frD) = (frA) + (frB); 



-1; \ 
t)] =1; \ 



FCMPO( bit, frA, frB ) 



(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 

(frD) 
(frD) 
(frD) 
(frD) 
(frD) 

(frD) 

(frD) 



(frA) / (frB) ; 



(frA) 
(frA) 
(frA) 
(frB) ; 
(frB) ; 
(frA) * 
(frA) * 
(frA) * 
(frA) * 
((frB) 



(frB) ; 

(frC) 

(frC) 



(frB) ; 
(frB) ,- 
(frC) - 
(frC) - 
•= 0.0) 



(frB) ; 
(frB) ; 
> - (frB) 



- (frB) ; 

-((frA) * (frC) + (frB)); 

-((frA) * (frC) + (frB)); 

-((frA) * (frC) - (frB)); 

-((frA) * (frC) - (frB)); 

(float) (frB) ; 

( (frA) >= 0.0) ? (frC) : 



BR ( label ) 

++ (rD) ; 

++ (rD) ; CR[0] 
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ttdefine LA ( rD, symbol, SIMM ) 
#define LABEL ( label ) 
ttdefine LBZ ( rD, rA, d ) 
ttdefine LBZA ( rD, symbol ) 
ttdefine LBZU( rD, rA, d ) 
ttdefine LBZUX( rD, rA, rB ) 
ttdefine LBZX( rD, rA, rB ) 
ttdefine LFD ( frD, rA, d ) 
ttdefine LFDU ( frD, rA, d ) 
ttdefine LFDUX( frD, rA, rB ) 
ttdefine LFDX ( frD, rA, rB ) 
ttdefine LFS ( frD, rA, d ) 
ttdefine LFSA ( frD, symbol, rT ) 
ttdefine LFSU( frD, rA, d ) 
ttdefine LFSUX ( frD, rA, rB ) 
ttdefine LFSX( frD, rA, rB ) 
ttdefine LHA ( rD, rA, d ) 
ttdefine LHAA ( rD, symbol } 
ttdefine LHAU ( rD, rA, d ) 
ttdefine LHAUX ( rD, rA, rB ) 
ttdefine LHAX ( rD, rA, rB ) 
ttdefine LHZ ( rD, rA, d ) 
ttdefine LHZ A ( rD, symbol } 
ttdefine LHZU( rD, rA, d ) 
ttdefine LHZUX( rD, rA, rB ) 
ttdefine LHZX( rD, rA, rB ) 
ttdefine LI ( rD, SIMM ) 
ttdefine LIS( rD, SIMM ) 
ttdefine LOAD_COUNT ( rD ) 
ttdefine LWZ ( rD, rA, d ) 
ttdefine LWZA( rD, symbol ) 
ttdefine LWZU( rD, rA, d ) 
ttdefine LWZUX ( rD, rA, rB ) 
ttdefine LWZX( rD, rA, rB ) 
ttdefine MCRF ( crfD, crfS ) 
ttdefine MCRFS ( crfD, crfS ) 
ttdefine MFCR ( rD ) 
ttdefine MFCTR ( rD ) 
ttdefine MFLR ( rD ) 
ttdefine MFSPR( rD, SPR ) 
ttdefine MOV( rA, rS ) 
ttdefine MOV_C ( rA, rS ) 
ttdefine MR ( rA, rS ) 
ttdefine MR C( rA, rS ) 
ttdefine MTCR ( rD ) 
ttdefine MTCTR ( rD ) 
ttdefine MTFSFI ( crfD, I MM ) 
ttdefine MTLR ( rD ) 
ttdefine MTSPR( SPR, rS ) 
ttdefine MULL I ( rD, rA, SIMM ) 
ttdefine MULLW ( rD, rA, rB ) 
ttdefine MULLW_C ( rD, rA, rB ) 
(long) (rD) ; 

ttdefine NAND ( rA, rS, rB ) 
ttdefine NAND_C ( rA, rS, rB ) 
rA) ; 

ttdefine NEG ( rD, rA ) 
ttdefine NEG_C ( rD, rA ) 
ttdefine NOP 

ttdefine N0R( rA, rS, rB ) 
ttdefine NOR_C ( rA, rS, rB ) 
rA) ; 

ttdefine OR( rA, rS, rB ) 
ttdefine OR C( rA, rS, rB ) 
(long) (rA) ; 

ttdefine ORC ( rA, rS, rB ) 
ttdefine ORC_C ( rA, rS, rB ) 



(rD) = 
label 
(rD) . 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(frD) 
(rD) . 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) . 
(rD) = 
(rD) = 
(rD) . 
(rD) = 
(rD) . 
(rD) = 
CTR = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 
(rD) = 



(rA) 
(rA) 
(rA) 
(rA) 



(long) & (symbol) ,- 

* (uchar *) ( (rA) + (d) ) ; 

* (uchar * ) & ( symbol ) ; 

* (uchar *) ( (rA) += (d) ) ; 

* (uchar *) ( (rA) += (rB) ) ; 

* (uchar *) ( (rA) + (rB) ) ; 
= * (double *) ( (rA) + (d) ) ; 
= * (double *)((rA) += (d) ) ; 
= M double *) ( (rA) += (rB) ) ,- 
= * (double *) ( (rA) + (rB) ) ; 
= * (float *) ( (rA) + (d) ) ; 

= * (float *)& (symbol) ; 
= * (float *) ((rA) += (d)) ; 
= * (float *) ( (rA) += (rB) ) ; 
= * (float *) ((rA) + (rB)) ; 
*( Short *) ( (rA) + (d) ) ; 
* ( short * ) & ( symbol ) ; 
* (short *) ((rA) += (d) ) ; 
* (short *) ( (rA) += (rB) ) ; 
* (short *) ( (rA) + (rB) ) ; 
* (ushort *) ( (rA) + (d) ) ; 



" \ ubiiui i_ - ) v \ Lt\) + \u 

* (ushort *)& (symbol) ; 



*) ( (rA) 
< 16) ; 



* (ushort *) ( (rA) 
■■ * (ushort *) ( (rA) 
: * (ushort 
■■ (SIMM) ; 
: ( (SIMM) 
(rD) ; 

= Mlong *) ( (rA) T |U 
: * ( long * ) & ( symbol ) ; 
■ Mlong *) ( (rA) += (d) ) ,- 
= Mlong *) ( (rA) += (rB) ) 
= Mlong *) ((rA) + (rB) ) ; 



(d) ) ; 
(rB) ) ; 
(rB) ) ; 



(d) ) ; 



(rS) 
(rS) 
(rS) 
(rS) 



CR[0] = (long) (rA) ; 
CR[0] = (long) (rA) ; 



(rD) 


= (rA) * 


(SIMM) ; 




(rD) 


= (rA) * 


(rB) ; 




(rD) 


= (rA) * 


(rB) ,- CR[0] = 


(rA) 


= ~((rS) 


& (rB) ) ; 




(rA) 


= ~((rS) 


& (rB) ) ; 


CR[0] = (long) ( 


(rD) 


= - (rA) ; 






(rD) 


= - (rA) ; 


CR[0] = 


(long) (rA) ; 


(rA) 


= ~((rS) 


1 (rB) ) ; 




(rA) 


= ~((rS) 


| (rB)); 


CR[0] = (long) ( 


(rA) 


= (rS) | 


(rB) ; 




(rA) 


= (rS) | 


(rB) ; CR 


[0] = 


(rA) 


= (rS) | 


-(rB) ; 




(rA) 


= (rS) | 


~ (rB) ; CR[0] = 
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(long) (rA) ; 

#define ORI ( rA, rS, UIMM ) (rA) = (rS) 

#define ORIS ( rA, rS, UIMM ) (rA) = (rS) 

#define RETURN BLR 
#define RLWIMI ( rA, rS, SH, MB, ME ) \ 

{ \ 

ulong mask; \ 

mask = ((1 « ( (ME) - (MB) +1)) - 1) < 
(rA) &= -mask; \ 

(rA) |= ((((rS) « (SH)) | ( (ulong) (rS) 



(31 - 
» (32 



(UIMM) ; 
( (UIMM) 



(ME)); \ 

- (SH) ) ) ) & mask) ; 



#define RLWIMI_C ( rA, rS, SH, 



IB, ME ) \ 
1) ) - 1) • 



ulong mask; \ 

mask = ((1 « ((ME) - (MB) +1)) - 1) << (31 

(rA) &= -mask; \ 

(rA) |= ((((rS) « (SH)) | ( (ulong) (rS) » (32 
CR[0] = (long) (rA) ; \ 



(ME) ) ; \ 

- (SH) ) ) ) & mask) ; 



#define RLWINM ( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ( (1 << ( (ME) - (MB) +1)) - 1) 
(rA) = (((rS) « (SH)) | ( (ulong) (rS) 

#define RLWINM C( rA, rS, SH, MB, ME ) \ 
{ \ 

ulong mask; \ 

mask = ((1 << ((ME) - (MB) +1)) - 1) 
(rA) = (((rS) « (SH)) | ( (ulong) (rS) 
CR[0] = (long) (rA) ; \ 



(ME) ) ; \ 

(SH) ) ) ) & mask; 



(32 - (SH) ) ) ) & mask; \ 



} 

#define RLWNM ( rA, rS, rB, MB, ME ) 
#define RLWNM_C ( rA, rS, rB, MB, ME 
) 

#define EXTLWI ( rA, rS, n, b ) 
#define EXTLWI C( rA, rS, n, b ) 
#define EXTRWI ( rA, rS, n, b ) 
#define EXTRWI C( rA, rS, n, b ) 
#define INSLWI ( rA, rS, n, b ) 
) 

ttdefine INSLWI_C ( rA, rS, n, b ) 
(b)+(n)-l ) 

ttdefine INSRWI ( rA, rS, n, b ) 
+(n)-l ) 

#define INSRWI_C ( rA, rS, n, b ) 
b) + (n)-l ) 

#define ROTLW( rA, rS, rB ) 
ttdefine ROTLW C( rA, rS, rB ) 
#define ROTLWI ( rA, rS, n ) 
ttdefine ROTLWI C( rA, rS, n ) 
ttdefine ROTRWI ( rA, rS, n ) 
ttdefine ROTRWI C( rA, rS, n ) 
ttdefine SLW( rA, rS, rB ) 
ttdefine SLW_C ( rA, rS, rB ) 
(long) (rA) ,- 

ttdefine SLWI ( rA, rS, SH ) 
ttdefine SLWI_C ( rA, rS, SH ) 
(long) (rA) ,- 

ttdefine SRAW( rA, rS, rB ) 
ttdefine SRAW_C ( rA, rS, rB ) 
long) (rA) ; 

ttdefine SRAWI ( rA, rS, SH ) 
ttdefine SRAWI_C ( rA, rS, SH ) 
long) (rA) ; 

ttdefine SRW ( rA, rS, rB ) 
ttdefine SRW_C ( rA, rS, rB ) 



RLWINM ( rA, rS, (rB) & Oxlf, MB, ME ) 
RLWINM_C( rA, rS, (rB) & Oxlf, MB, ME 

RLWINM ( rA, rS, (b) , 0, (n) -1 ) 
RLWINM C( rA, rS, (b) , 0, (n) -1 ) 
RLWINM ( rA, rS, (b) + (n) , 32- (n) , 31 ) 
RLWINM { rA, rS, (b) + (n) , 32- (n) , 31 ) 
RLWIMI ( rA, rS, 32- (b) , (b) , (b)+(n)-l 

RLWIMI_C( rA, rS, 32- (b), (b) , 

RLWIMI ( rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

RLWIMI_C( rA, rS, 32- ( (b) + (n) ) , (b) , ( 

RLWNM ( rA, rS, rB, 0, 31 ) 
RLWNM C( rA, rS, rB, 0, 31 ) 
RLWINM ( rA, rS, (n), 0, 31 ) 
RLWINM C( rA, rS, (n) , 0, 31 ) 
RLWINM ( rA, rS, 32- (n) , 0, 31 ) 
RLWINM ( rA, rS, 32- (n) , 0, 31 ) 



(rA) 
(rA) 


= (rS) « (rB) ; 
= (rS) << (rB) ; 


CR[0] 




(rA) 
(rA) 


= (rS) « (SH) ; 
= (rS) « (SH) ; 


CR [0] 




(rA) 
(rA) 


= (long) (rS) >> 
= (long) (rS) » 


(rB) ; 
(rB) ; 


CR[0] = 


(rA) 
(rA) 


= (long) (rS) >> 
= (long) (rS) » 


(SH) ; 
(SH) ; 


CR[0] = 


(rA) 
(rA) 


= (ulong) (rS) >: 
= (ulong) (rS) >: 


> (rB). 

> (rB) 


; CR[0] = 
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long) (rA) ; 

#define SRWI { rA, rS, SH ) 
#define SRWI_C( rA, rS, SH ) 
long) (rA) ; 

#define STB ( rS, rA, d ) 
#define STBU( rS, rA, d ) 
#define STBUX ( rS, rA, rB ) 
ttdefine STBX( rS, rA, rB ) 
#define STFD ( frD, rA, d ) 
#define STFDU( frD, rA, d ) 
ttdefine STFDUX ( frD, rA, rB ) 
ttdefine STFDX ( frD, rA, rB ) 
ttdefine STFS ( frD, rA, d ) 
ttdefine STFSU( frD, rA, d ) 
ttdefine STFSUX( frD, rA, rB ) 
ttdefine STFSX ( frD, rA, rB ) 
ttdefine STH( rS, rA, d ) 
ttdefine STHU< rS, rA, d ) 
ttdefine STHUX ( rS, rA, rB ) 
ttdefine STHX ( rS, rA, rB ) 
ttdefine STW( rS, rA, d ) 
ttdefine STWU ( rS, rA, d ) 
ttdefine STWUX { rS, rA, rB ) 
ttdefine STWX( rS, rA, rB ) 
ttdefine SUB { rD, rA, rB ) 
ttdefine SUB_C( rD, rA, rB ) 
(long) (rD) ; 

ttdefine SUBFIC ( rD, rA, SIMM ) 
ttdefine SUBI ( rD, rA, SIMM ) 
ttdefine SUBIC_C( rD, rA, SIMM ) 
rD) ; 

ttdefine SUBIS ( rD, rA, SIMM ) 
ttdefine TEST_COUNT( label ) 
ttdefine XOR ( rA, rS, rB ) 
ttdefine XOR_C ( rA, rS, rB ) 
(long) (rA) ,- 

ttdefine XORI ( rA, rS, UIMM ) 
ttdefine XORIS ( rA, rS, UIMM ) 

ttif defined ( BUILD_MAX ) 



(rA) = (ulong) (rS) » (SH) ; 

(rA) = (ulong) (rS) » (SH) ; CR[0] = ( 



* (char *) ( ( 
Mchar *) ( ( 
*(char *) ( ( 
*(char *) ( ( 
* (double *) 
* (double *) 
* (double *) 
* (double *) 
* (float 
* (float 
* (float 
* (float 
* (short 
* (short 
* (short 
* (short 
* (long 
* (long 
*(long *) ( ( 
*(long *) ( ( 
(rD) = (rA) 
(rD) = (rA) 



*) ( 



*) ( (: 
*) ( ( 



rA) + (d) ) = 
rA) += (d)) = 
rA) += (rB) ) 
rA) + (rB)) = 
((rA) + (d)) 
((rA) += (d)) 
((rA) += (rB) 
((rA) + (rB)) 
(rA) + (d)) = 
(rA) += (d) ) 
(rA) += (rB) ) 
(rB) ) 
(d)) = 
= (d)) 
= (rB)) 

(rB) ) 
(d)) 



(rA) 
(rA) 
(rA) 
(rA) 
(rA) 
rA) 

rA) += (d) ) 
rA) += (rB)) 
rA) + (rB) ) 
(rB) ,- 

(rB) ; CR[0 



(rS) ,- 
(rS) ; 

(rS) ; 
(rS) ,- 
= (frD) ; 
= (frD); 

= (frD); 
= (frD); 
(frD) ; 

(frD) ; 
= (frD); 
= (frD) ; 

(rS) ; 
= (rS) ; 

= (rS) ; 
= (rS) ,- 
(rS) ; 

(rS) ; 
= (rS) ; 
(rS) ; 

] = 



(rD) = (SIMM) - (rA) ; 

(rD) = (rA) - (SIMM) ; 

(rD) = (rA) - (SIMM); CR[0] = (long) ( 

(rD) = (rA) - ((SIMM) « 16); 

if ( --CTR ) goto label; 

(rA) = (rS) * (rB) ,- 

(rA) = (rS) * (rB) ; CR[0] = 



* VMX instructions 
*/ 

ttdefine BR VMX ALL TRUE ( label ) 
ttdefine BR VMX ALL FALSE ( label ) 
ttdefine BR VMX NONE TRUE ( label ) 
ttdefine BR VMX SOME FALSE ( label ) 
ttdefine BR_VMX_SOME_TRUE ( label ) 

ttdefine DSS ( STRM ) 

ttdefine DSSALL 

ttdefine DST( rA, rB, STRM ) 

ttdefine DSTT ( rA, rB, STRM ) 

ttdefine DSTST( rA, rB, STRM ) 

ttdefine DSTSTT ( rA, rB, STRM ) 



if( CR[6] & 0x8 ) goto label; 

if ( CR[6] & 0x2 ) goto label ,- 

if ( CR[6] & 0x2 ) goto label ; 

if( !(CR[6] & 0x8) ) goto label; 

if( !(CR[6] & 0x2) ) goto label; 



ttif defined ( COMPILE_LVX_CHARS ) 



ttdefine LVX ( vT, rA, rB ) \ 

{ \ 
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char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 16; i++ ) \ 

(vT) . c [C_INDEX_MUNGE ( i )] = addr[i] ; \ 

#define LVEBX ( vT, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 

i = (ulong) addr & VMX_ADD R_MAS K ; \ 

(vT) . c [C_INDEX_MUNGE ( i )] = addr[0]; \ 

#define LVEHX ( vT, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i = (ulong) addr & VMX_ADDR_MASK ; \ 
(vT).c[C INDEX MUNGE ( i )] = addr[0]; \ 
(vT) . c [C_INDEX_MUNGE ( i + 1 )] = addr[l]; \ 

ttdefine LVEWX( vT, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *) (((ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = (ulong) addr & VMX_ADDR_MASK ; \ 
(vT).c[C INDEX MUNGE ( i )] = addr[0]; \ 
(vT).c[C INDEX MUNGE ( i + 1 )] = addrfl]; \ 
(vT).c[C INDEX MUNGE ( i + 2 )] = addr [2]; \ 
(vT) . c [C_INDEX_MUNGE ( i + 3 )] = addr [3]; \ 

#elif defined ( COMPILE_LVX_SHORTS ) 

#define LVX< vT, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 8; i++ ) \ 

(vT) . s [S_INDEX_MUNGE ( i )] = addr[i]; \ 

#define LVEBX ( vT, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *) ((ulong) (rA) + (ulong) (rB) ) ,- \ 

i = (ulong) addr & VMX_ADDR_MASK ; \ 

(vT) . C [C_INDEX_MUNGE ( i )] = addr[0]; \ 

ttdefine LVEHX ( vT, rA, rB ) \ 

{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
(vT) .s [S_INDEX_MUNGE( i )] = addr[0]; \ 

ttdefine LVEWX ( vT, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i ; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~3); \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
(vT) . s [S_INDEX_MUNGE ( i )] = addr[0],- \ 
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(vT).s[S INDEX MUNGE ( i + 1 )] = addrtl]; \ 

} 

#else 

ttdefine LVX( vT, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *) (((ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; 
for ( i = 0; i < 4; i++ ) \ 

(vT).l[L INDEX MUNGE ( i )] = addr[i] ; \ 

} 

ttdefine LVEBX ( vT, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX_ADDR_MASK; \ 
(vT).c[C INDEX MUNGE ( i )] = addr[0]; \ 

I 

#define LVEHX ( vT, rA, rB ) \ 

{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB)) & ~1) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
(vT) . s [S_INDEX_MUNGE ( i )] = addr [03; \ 

#define LVEWX ( vT, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~3); \ 
i = ((ulong) addr & VMX ADDR MASK) >> 2; \ 
(vT).l[L INDEX MUNGE ( i )] = addr[0],- \ 

} 

#endif 

#if defined ( COMPILE_STVX_CHARS ) 

#define STVX( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; 
for ( i = 0; i < 16; i++ ) \ 

addr[i] = (vS).c[C INDEX MUNGE ( i )]; \ 

} 

#define STVEBX ( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *) ( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS).c[C INDEX MUNGE ( i )]; \ 

} 

#define STVEHX ( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS).c[C INDEX MUNGE ( i )]; \ 
addrfl] = (vS).c[C INDEX MUNGE ( i + 1 )]; \ 

} 
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#define STVEWX( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i; \ 

addr = (char *)(( (ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS).c[C INDEX MUNGE ( i )]; \ 
addr[l] = (vS).c[C INDEX MUNGE ( i + 1 )]; \ 
addr[2] = (vS).c[C INDEX MUNGE ( i + 2 ) ] ; \ 
addr [3] = (vS).c[C INDEX MUNGE ( i + 3 ) ] ; \ 

} 

#elif defined ( COMPILE_STVX_SHORTS ) 

#define STVX( vS, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i ; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 8; i++ ) \ 

addr[i] = (vS).s[S INDEX MUNGE ( i )]; \ 

} 

#define STVEBX( vS, rA, rB ) \ 

y>. { \ 

char *addr; \ 
y ulong i; \ 

14 addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 

ill i = (ulong) addr & VMX ADDR MASK; \ 

addr[0] = (vS).c[C INDEX MUNGE ( i ) ] ; \ 
} ' 
y«J #define STVEHX ( vS, rA, rB ) \ 

{ \ 

short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
addr[0] = (vS).s[S INDEX MUNGE ( i )]; \ 

) 

#define STVEWX ( vS, rA, rB ) \ 
{ \ 

short *addr; \ 
ulong i; \ 

III addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~3); \ 

i = ((ulong) addr & VMX ADDR MASK) » 1; \ 
addr[0] = (vS).s[S INDEX MUNGE ( i ) ] ; \ 
addr[l] = (vS).s[S INDEX MUNGE ( i + 1 ) ] ; \ 

} 

#else 

#define STVX ( vS, rA, rB ) \ 
{ \ 

long *addr; \ 
ulong i,- \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 4; i++ ) \ 

addr[i] = (vS).l[L INDEX MUNGE ( i ) ] ; \ 

} 

#define STVEBX ( vS, rA, rB ) \ 
{ \ 

char *addr; \ 
ulong i ,- \ 

addr = (char *)( (ulong) (rA) + (ulong) (rB) ) ; \ 
i = (ulong) addr & VMX ADDR MASK; \ 
addr[0] = (vS).c[C INDEX MUNGE ( i ) ] ; \ 

} 

#define STVEHX ( vS, rA, rB ) \ 
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} 



short *addr; \ 
ulong i; \ 

addr = (short *)(( (ulong) (rA) + (ulong) (rB) ) & ~1) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 1 ; \ 
addr[0] = (vS) . s [S_INDEX_MUNGE ( i )]; \ 

#define STVEWX ( vS, rA, rB ) \ 

{ \ 

long *addr; \ 
ulong i; \ 

addr = (long *)(( (ulong) (rA) + (ulong) (rB) ) & ~3) ; \ 
i = ((ulong) addr & VMX ADDR MASK) » 2; \ 
addrCO] = (vS).l[L INDEX_MXJNGE ( i )]; \ 

} 

#endif 

#define LVSL BE ( vT, rA, rB ) \ 
{ \ 

ulong i, j ; \ 

j = ( (ulong) (rA) + (ulong) (rB) ) & VMX_ADDR_MASK ; \ 
for ( i = 0; i < 16; i++ ) \ 
(vT) .ucti] = j + i; \ 

#define LVSR BE ( vT, rA, rB ) \ 
{ \ 

ulong i, j; \ 

j = 16 - ( ( (ulong) (rA) + (ulong) (rB) ) & VMX_ADDR_MASK) ; \ 
for ( i = 0; i < 16; i++ ) \ 
(vT) .uc[i] = j + i; \ 



#if defined ( LITTLE END IAN ) 
#define LVSL( vT, rA, rB ) 
#define LVSR( vT, rA, rB ) 
#else 

#define LVSL( vT, rA, rB ) 
#define LVSR( vT, rA, rB ) 
#endif 

#define LVXL ( vT, rA, rB ) 
#define STVXL ( vS, rA, rB ) 
#define VADDFP ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

float a, b, c; \ 

for ( i = 0; i < 4; i++ ) 
a = (vA) .f [i] ; \ 
b = (vB) .f [i] ; \ 



} \ 



<VT) .f [i] = c; \ 



LVSL BE ( vT, rA, rB ). ; 
LVSR_BE ( vT, rA, rB ) ; 



vB ) \ 



#define VADDSBS ( vT, 
{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 16; i++ ) { \ 

itemp = (long) (vA) .c[i] + (long) (vB) .c[i] ; \ 
if ( itemp < -128 ) (vT).c[i] = -128; \ 
else if ( itemp > 127 ) (vT).c[i] = 127; \ 
else (vT).cti] = (char) itemp; \ 

} \ 

} 

#define VADDSHS ( vT, vA, vB ) \ 
{ \ 
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ulong i; \ 
long itemp; \ 

for ( i = 0; i < 8; i++ ) { \ 

itemp = (long) (vA) .s [i] + ( long) (vB) . s [1] ,- \ 
if ( itemp < -32768 ) (vT).s[i] = -32768; \ 
else if ( itemp > 32767 ) (vT).s[i] = 32767; \ 
else (vT).s[i] = (short) itemp; \ 

> ,x 

ttdefine VADDSWS ( vT, vA, vB ) \ 

{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 4; i++ ) { \ 

itemp = (vA).l[i] + (vB).l[i]; \ 

if ( ( (vA).l[i] > 0) && ( (vB).l[i] > 0) && (itemp < 0) ) \ 
(vT).l[i] = (long) 0x7fffffff ; \ . 

else if ( ( (vA).l[i] < 0) && ( (vB).l[i] < 0) && (itemp > 0) ) \ 
(vT).l[i] = (long) 0x80000000; \ 

else (vT).l = itemp [i] ; \ 

, is 

#define VADDUBM ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).uc[i] = (vA).uc[i] + (vB).uc[i]; \ 

ttdefine VADDUBS ( vT, vA, vB ) \ 
{ \ 

ulong i, itemp; \ 

for ( i = 0; i < 16; i++ ) { \ 

itemp = (ulong) (vA) .uc[i] + (ulong) (vB) .uc [i] ; \ 

if ( itemp > 255 ) (vT).uc[i] = 255; \ 

else (vT).uc[i] = (uchar) itemp; \ 

, lv 

#define VADDUHM ( vT, vA, vB ) \ 

{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = (vA).us[i] + (vB).us[i]; \ 

#define VADDUHS ( vT, vA, vB ) \ 

ulong i, itemp; \ 

for ( i = 0; i < 8; i++ ) { \ 

itemp = (ulong) (vA) .us [i] + (ulong) (vB) .us [i] ; \ 

if ( itemp > 65535 ) (vT).uc[i] = 65535; \ 

else (vT).uc[i] = (ushort) itemp; \ 

) M 

#define VADDUWM( vT, vA, vB ) \ 
{ \ 

ulong l; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] + (vB).ul[i]; \ 

fldefine VADDUWS ( vT, vA, vB ) \ 

{ \ 

ulong i, itemp; \ 

for ( i = 0; i < 4; i++ ) { \ 

itemp = (vA).ul[i] + (vB).ul[i]; \ v 
if ( itemp < (vA).ul[i] ) (vT).ul[i] = (ulong) Oxfffff fff ; \ 
else (vT).ul[i] = itemp; \ 

, ) \ 
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#define VAND ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] & (vB).ul[i] ; \ 

#define VANDC { vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] & ~(vB) .ul [i] ; \ 

#define VCMPEQFP ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( (vA).f[i] == (vB).f[i] ) ? Oxffffffff : 0; \ 

#define VCMPEQFP C( vT, vA, vB ) \ 

{ \ 

ulong i; \ 

ulong t, f; \ 

t = Oxffffffff; \ 

H f = 0; \ 

O for ( i = 0; i < 4; i++ ) { \ 

H (vT).ul[i] = ( (vA).f[i] == (vB).f[i] ) ? Oxffffffff : 0; \ 

*~ t &= (vT) .ul [i] ; \ 

W f |= (vT) .ul [i] ; \ 

£ } \ 

JK if ( t ) CR[6] = 0x8; \ 

,;S else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

" tfdefine VCMPEQUB ( vT, vA, vB ) \ 

%t { \ 

hs» ulong i; \ 

III for ( i = 0; i < 16; i++ ) \ 

="1 (vT).ucti] = ( (vA).ucti] == (vB).uc[i] ) ? Oxff : 0; \ 

■K: } 

4* tide fine VCMPEQUB_C ( vT, vA, vB ) \ 

© ( \ n . . 

ulong a; \ 
uchar t, f; \ 
t = Oxff; \ 
f = 0; \ 

for ( i = 0; i < 16; i++ ) { \ 

(vT).uc[i] = ( (vA).uc[i] == (vB).uc[i] ) ? Oxff : 0; \ 
t &= (vT) .uc [i] ; \ 
f 1= (vT) .uc[i] ; \ 

} \ 

if < t ) CRC6] = 0x8; \ 
else if ( !f ) CR[6] = 0x2; \ 
else CR[6] = 0; \ 

^define VCMPEQUH ( vT, vA, vB ) \ 
{ \ 

ulong l ,- \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = ( (vA).us[i] == (vB).us[i] ) ? Oxffff : 0; \ 

#define VCMPEQUH C( vT, vA, vB ) \ 
{ \ 

ulong l ; \ 
ushort t , f ; \ 
t = Oxffff; \ 
f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 



m 



Page No. 311 



EV 093 931 868 US 
Page No. 338 

salppc .h 



(vT).us[i] = ( (vA).us[i] == (vB).us[i] ) ? Oxffff : 0; \ 
t &= (vT) -us [i] ; \ 
f |= (vT) .us[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPEQUW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( (vA).ul[i] == (vB).ul[i] ) ? Oxffffffff : 0 ; " 

#define VCMPEQUW_C ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = ( (vA).ulli] == (vB).ul[i] ) ? Oxffffffff : 0 ; 
t &= (vT) .ul [i] ; \ 
f |= (vT) .ul [i] ; \ 

} \ 

if { t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGEFP( vT ; vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( (vA).f[i] >= (vB).f[i] ) ? Oxffffffff : 0; \ 

#define VCMPGEFP_C( vT, vA, vB ) \ 
{ \ 

ulong i ; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[il = ( (vA).fti] >= (vB).f[i] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [i] ; \ 
f |= (vT) .ul [i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCMPGTFP ( vT, vA, vB ) \ 

{ \ 

ulong i ; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ( (vA).f[i] > (vB).f[i] ) ? Oxffffffff : 0; \ 

#define VCMPGTFP_C ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for { i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = ( (vA).fEi] > (vB).fEi] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [il ; \ 
f |= (vT) .ul[i] ; \ 

} \ 
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i 



if ( t ) CR[6] = 0x8; 
else if ( !f ) CR[6] 
else CR[6] = 0; \ 



#define VCMPGTSB ( vT, 
{ \ 

ulong i; \ 
for ( i = 0; i 
(vT) .uc [i] = 

} 

#define VCMPGTSB C( vT, 
{ \ 

ulong i; \ 
uchar t, f; \ 
t = Oxff; \ 
f = 0; \ 
for ( 



vA, vB ) \ 



: 16; i++ ) \ 
( (vA).c[i] > 



(vB) .c [i] ) ? Oxff : 0; \ 



16; i++ ) { \ 



8; i++ ) \ 
( (vA).s[i] 



(vT) .uc[i] = ( (vA) .c[i] 
t &= (vT) .uc [i] ; \ 
f |= (vT) .uc[i] ; \ 

} \ 

if { t ) CR[6] = 0x8; \ 
else if ( !f ) CR[6] = 0x2; 
else CR[6] = 0; \ 

} 

^define VCMPGTSH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
for ( i = 0; 
(vT) .us [i] 

} 

#define VCMPGTSH C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ushort t, f; \ 
t = Oxffff; \ 
f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 
(vT) .us[i] = ( (vA) .s[i] 
t &= (vT) .us [i] ; \ 
f |= (vT) .us[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 
else if ( if ) CR[6] = 0x2; 
else CR[6] = 0; \ 

} 

VCMPGTSW( vT, vA, vB ) \ 
ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = ( (vA) .l[i] 



(VB) .C [i] ) ? Oxff : 0; \ 



(VB) .S [i] ) ? Oxffff : 0; \ 



(vB) .s[i] ) ? Oxffff 



#define 
{ \ 



} 

#define VCMPGTSW C( vT, vA, vB ) ' 

{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0 ; i < 4 ; i++ ) { 
(vT).ul[i] = ( (vA).l[i] 
t &= (vT) .ul [i] ; \ 
f |= (VT) .Ul [i] ; \ 

} \ 

if ( t } CR[6] = 0x8; \ 
else if ( Jf ) CR[6] = 0x2; 
else CR[6] = 0; \ 



> (vB).l[i] ) ? Oxffffffff : 0; \ 



> (vB).l[i] ) ? Oxffffffff : 0; \ 
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#define VCMPGTUB( vT, vA, vB ) \ 
{ \ 

ulong i ; \ 

for ( i = 0; i < 16; i++ ) \ 

(vT).uc[i] = ( <vA).uc[i] > (vB).uc[i] ) ? Oxff : 0; \ 

} 

#define VCMPGTUB C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
uchar t, f; \ 
t = Oxff; \ 
f = 0; \ 

for ( i = 0; i < 16; i++ ) { \ 

(vT).uc[i] = ( (vA).uc[i] > (vB).uc[i] ) ? Oxff : 0; \ 
t &= (vT) .uc [i] ; \ 
f |= (vT) .uc[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( if ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

} 

#define VCMPGTUH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = ( (vA).us[i] > (vB).us[i] ) ? Oxffff : 0; \ 

} 

#define VCMPGTUH C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ushort t, f; \ 
t = Oxffff; \ 
f = 0; \ 

for ( i = 0; i < 8; i++ ) { \ 

(vT).us[i] = { (vA).us[i] > (vB).us[i] ) ? Oxffff : 0; \ 
t &= (vT) .us [i] ; \ 
f |= (vT) .us [i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( If ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

} 

#define VCMPGTUW( vT, vA, vB } \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = { <vA).ul[i] > (vB).ul[i] ) ? Oxffffffff : 0; \ 

} 

#define VCMPGTUW C( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
ulong t, f; \ 
t = Oxffffffff; \ 
f = 0; \ 

for ( i = 0; i < 4; i++ ) { \ 

(vT).ul[i] = { (vA).ul[i] > (vB).ul[i] ) ? Oxffffffff : 0; \ 
t &= (vT) .ul [i] ; \ 
f |= (vT) .ul[i] ; \ 

} \ 

if ( t ) CR[6] = 0x8; \ 

else if ( !f ) CR[6] = 0x2; \ 

else CR[6] = 0; \ 

#define VCFSX( vT, vB, UIMM ) \ 
{ \ 

float f j ; \ 
ulong i, j; \ 
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j = (127 - ( (UIMM) & Oxlf)) << 23; \ 

fj = * (float *)&j; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = (float) ((vB) .l[i]) / fj; \ 

#define VCFUX( vT, vB, UIMM ) \ 
{ \ 

float f j ; \ 
ulong i, j ; \ 

j = (127 - ((UIMM) & Oxlf)) << 23; \ 

fj = * (float *)&j; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = (float) ((vB) -Ul[i]) / fj; \ 

^define VCTSXS ( vT, vB, UIMM ) \ 

{ \ 

float f, g, max, scale; \ 
ulong i; \ 
long 1 ; \ 

i = (127 + 31) << 23; \ 
max = * (float *)&i; \ 

i = (127 + ((UIMM) & Oxlf)) « 23; \ 

scale = * (float *)&i; \ 

for ( i = 0; i < 4; i++ ) { \ 

f = (vB) .f [i] ; \ 

g = f * scale; \ 

if ( g <= -max ) 1 = 0x80000000; \ 
else if ( g >= max ) 1 = 0x7fffffff; \ 
else 1 = (long)f << ( (UIMM) & Oxlf); \ 
(vT) .l[i] = 1; \ 

#define VCTUXS ( vT, vB, UIMM ) \ 
{ \ 

float f, g, max, scale; \ 
ulong i, ul; \ 
i = (127 + 32) « 23; \ 
max = * (float *)&i; \ 

i = (127 + ((UIMM) & Oxlf)) « 23; \ 

scale = * (float *)&i; \ 

for ( i = 0; i < 4; i++ ) { \ 

f = (vB) .f [i] ; \ 

g = f * scale; \ 

if ( g <= 0 ) ul = 0; \ 

else if ( g >= max ) ul = Oxffffffff; \ 
else ul = (ulong) f « ((UIMM) & Oxlf); \ 
(vT) .ul [i] = ul; \ 

#define VEXPTEFP ( vT, vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = exp (0.693147180559945 * (vB).f[i]>; \ 

#define VLOGEFP( vT, vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = 1.442695040888963 * log ( (vB) . f [i] ) ; \ 

} 

#define VMADDFP ( vT, vA, vC, vB ) \ 

{ \ 

ulong i; \ 

float a, b, c, d; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) .£ [i] ; \ 

b = (vB) -f [i] ; \ 

c = (vC) .f [i] ; \ 
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d = a * c; \ 
d = b + d; \ 
(vT) .f [i] = d; \ 

#define VMAXFP ( vT, vA, vB ) \ 

{ \ 

ulong i; \ 

for { i = 0; i < 4; i++ ) \ 
^ (vT).f[i] = <(vA).f[i] >= (vB).f[i]) ? (vA).f[i] : (vB).f[iJ; \ 

#define VMAXSB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).c[i] = (<vA).c[i] >= (vB).c[i]J ? (vA).c[i] : (vB).c[i] ; \ 

#define VMAXSH { vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 
^ (vT).s[i] = ((vA).s[i] >= (vB).s[i]) ? (vA).s[i] : (vB).s[i] ; \ 

#define VMAXSW( vT, vA, vB ) \ 
{ \ 

ulong i ; \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT).l[i] = ((vA).l[i] >= (vB) .l[i] ) ? (vA).l[i] : <vB).l[i]; \ 

ttdefine VMAXUB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).uc[i] = ((vA).uc[i] >= (vB) .uc [i] ) ? (vA).uc[i] : (vB> .uc [i] ; 

#define VMAXUH ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for { i = 0; 1 < 8; i++ ) \ 
^ (vT).us[i] = ((vA).us[i] >= (vB).us[i]) ? (vA).us[i] : (vB).us[i] ; 

define VMAXUW ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT).ul[i] = ((vA).ulfi] >= (vB).ul[i]) ? (vA).ul[i] : (vB) .ul [i] ; 

#define VMHADDSHS ( vD, vA, vB , vC ) \ 
{ \ 

ulong i ; \ 
long a; \ 

for ( i = 0; i < 8; i++ ) { \ 

a = (long) (vA) .s[i] * (long) (vB) . s [i] ; \ 
a »= 15; \ 

a += (long) (vC) .s [i] ; \ 
if ( a > 32767 ) a = 32767; \ 
else if ( a < -32768 ) a = -32768; \ 
(vD).s[i] = (short)a; \ 

#define VMHRADDSHS ( vD, vA, vB, vC ) \ 

{ \ 

ulong i ,- \ 
long a; \ 

for ( i = 0; i < 8; i++ ) { \ 

a = (long) (vA> .s [i] * (long) (vB) . s [i] ; \ 
a += 0x00004000; \ 
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a »= 15; \ 

a += (long) (vC) .s [i] ; \ 

if { a > 32767 ) a = 32767; \ 

else if ( a < -32768 ) a = -32768; \ 

(vD) .s [i] = (short) a; \ 

#define VMINFP { vT, vA, vB ) \ 

{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
^ (vT).f[ij = ((vA).f[i] <= (vB).f[i]) ? (vA).f[i] : (vB).f[i] ; \ 

#define VMINSB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).c[i] = ((vA).c[i] < = (vB).c[i]) ? (vA).c[i] : (vB).c[i] ; \ 

#define VMINSH ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).s[i] = ((vA).s[i] <= (vB).s[i]) ? (vA).s[i] : (vB).s[i] ; \ 

#define VMINSW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ <vT).l[i] = ((vA).l[i] <= (vB).l[i]) ? (vA).l[i] : (vB).l[i] ; \ 

^define VMINUB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for { i = 0; i < 16; i++ ) \ 
^ (vT).uc[i] = ((vA).uc[ij <= (vB).uc[i]) ? (vA).uc[i] : (vB).uc[i] ; 

#define VMINUH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).us[i] = ((vA).us[i] <= (vB).us[i]) ? (vA).us[i] : (vB) .us [i] ; 

#define VMINUW( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 16; i++ ) \ 
^ (vT).ulti] = (<vA).ul[i] <= (vB).ul[i]) ? (vA) .ul [i] : (vB).ul[i] ; 

ttdefine VMLADDUHM ( vD, vA, vB, vC ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 

a = (ulong) (vA) .us[i] ; \ 

b = (ulong) (vB) .us [i] ; \ 

c = (ulong) (vC) .us[i] ; \ 

c += (a * b) ; \ 

(vD).us[i] = (ushort)c; \ 

. ,v 

ttdefine VMR( vD, vS ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vD) .ul [i] = (vS) .ul [i] ; \ 



Page No. 317 



EV 093 931 868 US 
Page No. 344 

salppc.h 



^define VMRGHB BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 

for ( i = 0; i < 8; i++ ) { 
j = i + i; \ 
v.ucfj] = (vA) .uc[i] ; \ 

^ ^V.UC[(j+l>] = (VB) .UC[i] ; 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 



#define VMRGHH BE ( vT, 
{ \ 

VMX reg v; \ 
ulong i, j ; \ 
for ( i = 0; i < 

\ 



vA, vB ) \ 



i++ ) { \ 



v.us[j] = (vA) .us[i] ; \ 
v.us[(j + l)] = (vB) -us[i] 



for ( i = 0; i 
(vT) .ul[i] 



4; i++ ) \ 
'.ul[i] ; \ 



ttdefine VMRGHW BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j ; \ 

for ( i = 0; i < 2; i++ ) { \ 
j = i + i; \ 
v.ulfj] = (vA) .ul[i] ; \ 
v.ul[(j + l)] = (vB) .ul[i] ; \ 

1 \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 



vB ) \ 



i++ ) { \ 



#def ine VMRGLB BE ( vT, 
{ \ 

VMX reg v; \ 

ulong i, j; \ 

for ( i = 0; i < 
j = i + i; \ 
v.uc[j] = (vA) .uc[(8+i)]; \ 
v.uc[(j+l)] = (vB) .uc[(8+i>] ; 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

} 

#def ine VMRGLH BE ( vT, vA, vB ) \ 
{ \ 

VMX reg v; \ 
ulong i, j ; \ 

for ( i = 0; i < 4; i++ ) { \ 
j = i + i; \ 

v.ustj] = (vA) .us[{4+i) J ; \ 
v.us[(j+l)] = (vB) .us[(4+i)] ; 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

} 

#define VMRGLW BE ( vT, vA, vB ) \ 

{ \ 

VMX reg v; \ 
ulong i , j ; \ 

for ( i = 0; i < 2; i++ ) { \ 
j = i + i; \ 

v.ultj] = (vA) .ul[(2+i)J; \ 
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v.ul[(j+l)] 

} \ 

for ( i = 0,- i 
(vT).ulti] = 



(vB) .ul[(2+i)] ; \ 



: 4; i++ ) \ 
v.ul[i] ; \ 



#if defined ( L I TTLE_END IAN ) 
ttdefine VMRGHB ( vT, vA, vB ) 
ttdefine VMRGHH ( vT, vA, vB ) 
ttdefine VMRGHW { vT, vA, vB ) 
ttdefine VMRGLB { vT, vA, vB ) 
#define VMRGLH ( vT, vA, vB ) 
#define VMRGLW ( vT, vA, vB ) 
#else 

ttdefine VMRGHB ( vT, vA, vB ) 
#define VMRGHH ( vT, vA, vB ) 
#define VMRGHW ( vT, vA, vB ) 
#define VMRGLB ( vT, vA, vB ) 
ttdefine VMRGLH ( vT, vA, vB ) 
#define VMRGLW ( vT, vA, vB ) 
#endif 

#define VMSUMMBM ( vT, vA, vB, 
{ \ 

ulong i , j ; \ 
long a, c; \ 
ulong b; \ 

for ( i = 0; i < 4; i++ 

c = (vC) .1 [i] ; \ 

for ( j =0; j < 4; 
a = (long) (vA) .c[ 
b = (ulong) (vB) .u 
c += (a * b) ; \ 

} \ 

(VT) .l[i] = C; \ 

. M 

ffdefine VMSUMSHM ( vT, vA, vB, vC ) \ 
{ \ 

ulong i, j ; \ 
long a, b, c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) .1 [i] ; \ 
for ( j = 0; j < 2; j++ ) { 
a = (long) (vA) .s [4*i+j] ; ' 
b = (long) (vB) .s[4*i+j] ; ' 



VMRGLB BE ( vT, vB, 

VMRGLH BE ( vT, vB, 

VMRGLW BE ( vT, vB, 

VMRGHB BE ( vT, vB, 

VMRGHH BE { vT, vB, 

VMRGHW_BE( vT, vB, 

VMRGHB BE ( vT, vA, 

VMRGHH BE ( vT, vA, 

VMRGHW BE ( vT, vA, 

VMRGLB BE ( vT, vA, 

VMRGLH BE ( vT, vA, 

VMRGLW_BE( vT, vA, 



vA ) ; 

vA ) ; 

vA ) ; 

vA ) ; 

vA ) ; 

vA ) ; 

vB ) ; 

VB ) ; 

vB ) ; 

VB ) ; 

vB ) ; 

vB ) ; 



) { \ 
}++ > { \ 
3[4*i+j] ; 



} \ 

(vT) .1 [i] 



(a * b) ; 



#define VMSUMSHS ( vT, vA, vB, vC ) \ 
{ \ 

ulong i , j ; \ 
long a, b; \ 
double c; \ 

for ( i = 0; i < 4; i++ ) { \ 
c = (double) (vC) .1 [i] ; \ 

for ( j = 0; j < 2; j++ ) { \ 
a = (long) (vA) . s [4*i+j] ; \ 
b = (long) (vB) .s[4*i+j] ; \ 
c += (double) (a * b) ; \ 

} \ 

if ( c >= 2147483647.0 ) c = 2147483647.0; \ 
else if ( c <= -2147483648.0 ) c = -2147483648.0; \ 
(vT) .1 [i] = (long)c; \ 

} \ 
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} 

#define VMSUMUBM{ vT, vA, vB, vC ) \ 
{ \ 

ulong i, j ; \ 
ulong a, b, c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) .ul [i] ; \ 

for ( j = 0; j < 4; j++ ) { \ 

a = (ulong) (vA) .uc [4*i+j] ; \ 
b = (ulong) (vB) .uc [4*i+j] ; \ 
c += (a * b) ; \ 

} \ 

(vT) .Ul [i] = C; \ 

I ' 

ttdefine VMSUMUHM( vT, vA, vB, vC ) \ 
{ \ 

ulong i , j ; \ 
ulong a, b, c; \ 
for ( i = 0; i < 4; i++ ) { \ 
c = (vC) .ul[i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 
a = (ulong) (vA) .us [4*i+j ] ; \ 
b = (ulong) (vB) .us [4*i+j] ,- \ 
c += (a * b); \ 

} \ 

(vT) .Ul [i] = c; \ 

#define VMSUMUHS ( vT, vA, vB, vC ) \ 
{ \ 

ulong i, j; \ 
ulong a, b; \ 
double c; \ 

for ( i = 0; i < 4; i++ ) { \ 

c = (double) (vC) .ul [i] \ 

for ( j = 0; j < 2; j++ ) { \ 
a = (ulong) (vA) .us [4*i+j] ; \ 
b = (ulong) (vB) .us [4*i+j ] ; \ 
c += (double) (a * b) ; \ 

} \ 

if ( C >= 4294967295.0 ) c = 4294967295.0; \ 
(vT).ul[i] = (ulong) c; \ 

^define VMULESB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
a = (long) (vA) .c [2*i] ; \ 
b = (long) (vB) .c[2*i] ; \ 
c = a * b; \ 
(vT) .s[i] = (short) c; \ 

. M 

#define VMULESH ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

long a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (long) (vA) .s [2*i] ,- \ 

b = (long) (vB) .s [2*i] ; \ 

c = a * b; \ 

(vT) .l[i] = (long)c; \ 
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#define VMULEUB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
a = (ulong) (vA) .uc[2*i] ; \ 
b = (ulong) (vB) .uc[2*i] ; \ 
c = a * b; \ 

(vT).us[i] = (ushort)c; \ 

» )X 

^define VMULEUH ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (ulong) (vA) .us [2*i] ; \ 

b = (ulong) (vB) .us [2*i] ; \ 

c = a * b; \ 

(vT) .ul [i] = (ulong) c; \ 

tfdefine VMULOSB ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long a, b, c; \ 

for ( 1 = 0; i < 8; i++ ) { \ 
a = (long) (vA) .c [2*i+l] ; \ 
b = (long) (vB) .c [2*i+l] ; \ 
C = a * b; \ 
(vT) .s[i] = (short) c; \ 

#define VMULOSH( vT, vA, vB ) \ 
{ \ 

ulong i ; \ 

long a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (long) (vA) .s[2*i+l] ; \ 

b = (long) (vB) .s[2*i+l] ; \ 

c = a * b; \ 

(vT) .l[i] = (long)c; \ 

. M 

#define YMULOUB ( vT ; vA, vB ) \ 
{ \ 

ulong i ; \ 

ulong a, b, c; \ 

for ( i = 0; i < 8; i++ ) { \ 
a = (ulong) (vA) .uc [2*i+l] ; \ 
b = (ulong) (vB) .uc[2*i+l] ; \ 
c = a * b; \ 

(vT).usti] = (ushort)c; \ 

ttdefine VMULOUH( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

ulong a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (ulong) (vA) .us[2*i+l]; \ 

b = (ulong) (vB) .us [2*i+l] ; \ 

c = a * b; \ 

(vT).ul[i] = (ulong)c; \ 

#define VNMSUBFP ( vT, vA, vC, vB ) \ 
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{ \ 

ulong i; \ 

float a, b, c, d; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = <vA) .f [i] ; \ 

b = (vB) .f [i] ; \ 

c = (vC) .f [i] ; \ 

d = a * c; \ 

d = b - d; \ 
(vT) .f [i] = d; \ 

#define VN0R( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = ~((vA).ul[i] | (vB) .ul [i] ) ; \ 

ttdefine VNOT ( vT, vA ) VN0R( vT, vA, vA ) 

#define VOR ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] | (vB).ul[i] ; \ 

#define VPERM BE ( vT, vA, vB, vC ) \ 
{ \ 

VMX reg v; \ 
ulong field, i; \ 
for ( i = 0; i < 16; i++ ) { \ 
field = (vC) .uc[i] ; \ 

v.ucfi] = ( field < 16 ) ? (vA) . uc [field] : (vB) . uc [field - 16]; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VPKUHUM_BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 
v.ucfi] = (vA) .uc[(j)] ; \ 
v.uc[i+8] = (vB) .uc[(j)] ; \ 
j += 2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul[i] = v.ul[i] ; \ 

#define VPKUHUS BE ( vT, vA, vB, base ) \ 
{ \ 

•VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 

v.uc[i] = (vA) .uc [ (j"l) ] ? (uchar)255 : (vA) . uc [ ( j ) ] ; \ 
v.uc[i+8] = (vB) .uc[(j A l)] ? (uchar)255 : (vB) .uc [ ( j ) ] ; \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VPKSHUS BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 
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for ( i = 0; i < 8; i++ ) { \ 

if ( (vA).s[i] <= 0 ) v.uc[i] = 0 ,- \ 

else if ( <vA).s[i] >= 255 ) v.uc[i] = 255; \ 

else v.uc[i] = (vA).ucEj]; \ 

if ( (vB).s[i] <= 0 ) v.uc[i+8] = 0; \ 

else if ( (vB).s[i] >= 255 ) v.uc[i+8] = 255; \ 

else v.uc[i+8] = (vB).uc[j]; \ 

) x j + " 2! X 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VPKSHSS BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j ; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 

if ( (vA).s[i] <= -128 ) v.c[i] = -128; \ 
else if { <vA).s[i] >= 127 ) v.c[i] = 127; \ 
elsev.cfi] = <vA).c[j]; \ 

if ( (vB).s[i] <= -128 ) v.c[i+8] = -128; \ 
else if { (vB).s[i] >= 127 ) v.c[i+8] = 127; \ 
else v.c[i+8] = (vB).c[j] ; \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VPKUWUM_BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j ; \ 
j = base; \ 

for ( i = 0; i < 4; i++ ) { \ 
v.us[i] = (vA) -us[(j)] ; \ 
v.us[i+4] = <vB) .us[(j)] ; \ 
j +=2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VPKUWUS BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i , j ; \ 
j = base; \ 

for ( i = 0; i < 4; i++ ) { \ 

v.us[i] = (vA) .us [ (j A l) ] ? (ushort) 65535 : (vA) . us [ ( j ) ] ; \ 
v.us[i + 4] = (vB) .us[(j"l)] ? (ushort) 65535 : (vB) .US [ ( j ) ] ; \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ul [i] ; \ 

#define VPKSWUS BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i, j; \ 
j = base; \ 

for ( i = 0; i < 4; i++ ) { \ 

if ( (vA).l[i] <= 0 ) v.us[i] = 0; \ 

else if ( (vA).l[i] >= 65535 ) v.usfi] = 65535; \ 

else v. us [i] = (vA).us[j] ; \ 

if ( (vB).l[i] <= 0 ) v.us[i+4] = 0; \ 

else if ( (vB).l[i] >= 65535 ) v.us[i+4] = 65535; \ 

else v.us [i+4] = (vB).us[j]; \ 
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} 



} \ 

for ( i = 0; i 
(vT) .ul [i] 



4; i++ ) \ 
v.ul[i] ; \ 



#define VPKSWSS BE ( vT, vA, vB, base ) \ 
{ \ 

VMX reg v; \ 
ulong i , j ; \ 
j = base; \ 

for ( i = 0; i < 8; i++ ) { \ 

if ( (vA).l[i] <= -32768 ) V.sEi] 
else if ( (vA).l[i] >= 32767 ) v.s 
else v.s [i] = (vA) ,s [j] ; \ 

if ( (vB).l[i] <= -32768 ) v.s[i+8] = -32768; \ 
else if ( (vB).l[i] >= 32767 ) v.s[i+8] = 32767; \ 
else v.s [i + 8] = (vB) .s [j] ; \ 
j += 2; \ 

} \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = v.ulfi] ; \ 



-32768; \ 

] = 32767; \ 



#if defined ( LITTLE END IAN ) 
#define VPERM( vT, vA, vB, vC ) 
#define VPKUHUM ( vT, vA, vB ) 
#define VPKUHUS ( vT, vA, vB ) 
#define VPKSHUS ( vT, vA, vB ) 
#define VPKSHSS ( vT, vA, vB ) 
#define VPKUWUM ( vT, vA, vB ) 
#define VPKUWUS ( vT, vA, vB ) 
#def ine VPKSWUS ( vT, vA, vB ) 
#def ine VPKSWSS ( vT, vA, vB ) 
#else 

#define VPERM ( vT, vA, vB, vC ) 
#define VPKUHUM( vT, vA, vB ) 
#define VPKUHUS ( vT, vA, vB ) 
#define VPKSHUS ( vT, vA, vB ) 
#define VPKSHSS ( vT, vA, vB ) 
#define VPKUWUM ( vT, vA, vB ) 
#define VPKUWUS { vT, vA, vB ) 
#def ine VPKSWUS ( vT, vA, vB ) 
#define VPKSWSS ( vT, vA, vB ) 
iendif 

) \ 



VPERM BE ( vT, vB, vA, vC ); 

VPKUHUM BE ( vT, vB, vA, 0 ) 

VPKUHUS BE ( vT, vB, vA, 0 ) 

VPKSHUS BE ( vT, vB, vA, 0 ) 

VPKSHSS BE ( vT, vB, vA, 0 ) 

VPKUWUM BE ( vT, vB, vA, 0 ) 

VPKUWUS BE ( vT, vB, vA, 0 ) 

VPKSWUS BE ( vT, vB, vA, 0 ) 

VPKSWSS_BE( vT, vB, vA, 0 ) 

VPERM BE ( vT, vA ; vB, vC ); 
VPKUHUM BE ( vT, vA, vB, 
VPKUHUS BE ( vT, vA, vB, 



VPKSHUS BE ( vT, 

VPKSHSS BE ( vT, 

VPKUWUM BE ( vT, 

VPKUWUS BE ( vT, 

VPKSWUS BE ( vT, 

VPKSWSS_BE( vT, 



vA, vB, 1 ) ; 

vA, vB, 1 ) ; 

vA, vB, : 

vA, vB, : 

vA, vB, : 

vA, vB, : 



#define VREFP( vT, 
{ \ 

for ( i = 0; i < 4; i++ ) \ 

(vT) .f [i] = 1.0 / (vB) .f [i] ; \ 

} 

ttdefine VRFIM( vT, vB ) \ 
{ \ 

float f, max, r; \ 
ulong i; \ 

i = (127 + 31) << 23; \ 
max = * (float *)&i; \ 
for ( i = 0; i < 4; i++ ) { \ 
f = (vB) .f [i] ; \ 

if ( (f >= -max) && (f < max) ) { \ 
r = (float) ((long)f) ; \ 
if ( r > f ) --r; \ 
f = r; \ 

} \ 

(vT) .f [i] = f ; \ 



#define VRFIN( vT, vB ) \ 
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{ \ 

float f, r, s; \ 
ulong i; \ 
long lr; \ 

for ( i = 0; i < 4; i++ ) { \ 
s = f = (vB) .f [i] ; \ 
if ( f < 0.0 ) f = -f; \ 
r = f + 0.5; \ 
if ( r != £ ) f \ 
lr = (long)r; \ 
f = (float) lr; \ 

if ( f == r ) f = (float) (lr & ~1) ; \ 

} \ 

if ( s < 0.0 ) f = -f; \ 
(vT) .f [i] = f ; \ 

. M 

#define VRFIP( vT, vB ) \ 

{ \ 

float f, max, r; \ 
ulong i; \ 

i = (127 + 31) << 23; \ 

max = * (float *)&i; \ 
jh* for ( i = 0; i < 4; i + + ) { \ 

p f = (vB) .f [il ; \ 

k «* if ( (f >= -max) && (f < max) ) { \ 

^ r = (float) ( (long) f) ; \ 

if ( r < f ) ++r; \ 

{, >x" " v 

12 (vT) .f [i] = f ; \ 

I- } \ 

m } 

#define VRFIZ( vT, vB ) \ 

U t \ 

Shs float f, max; \ 

Id ulong i; \ 

" i = (127 + 31) << 23; \ 

** max = * (float *)&i; \ 

B p for ( i = 0; i < 4; i++ ) { \ 

f = (vB) -f [i] ; \ 
„, if ( (f >= -max) && (f < max) ) \ 

f = (float) ( (long)f) ; \ 
(vT) .f [i] = f ; \ 

, )X 

#define VRLB ( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 
sh = (vB) .uc [i] & 0x7; \ 

(vT).uc[i] = ( (vA) .uc[i] << sh) | <(vA).uc[i] >> (8-sh) ) ; \ 

> ' 

#define VRLH ( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 
sh = (vB) .us [i] & Oxf ; \ 

(vT).us[i] = <(vA).us[i] << sh) | ((vA).us[i] » (16-sh) ) ; \ 

#define VRSQRTEFP ( vT, vB ) \ 
{ \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).f[i] = 1.0 / sqrt ( (vB) .f [i] ) ; \ 
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#define VRLW ( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 4; i++ ) { \ 
Sh = (vB) .ul [i] & 0x1 f; \ 

(vT).ul[i] = ((vA).ul[i3 « sh) | ((vA).ul[i] >> (32-sh) ) ; \ 

#define VSEL ( vT, vA, vB, vC ) \ 
{ \ 

ulong aterap, btemp, i; \ 

for ( i = 0; i < 4; i++ ) { \ 

atemp = (vA) .ul [i] & ~<vC).ul[i]; \ 

btemp = (vA).ul[i] & (vC) .ul [i] ; \ 

(vT).ul[i] = atemp | btemp; \ 

#define VSL( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

Sh = (vB) .ul [3] & 0x7; \ 

(vT).ul[0] = ({vA).ul[0] << sh) | <(vA).ul[l] >> (32-sh)); \ 

(vT).ul[l] = ({vA).ul[l] « sh) I ((vA).ul[2] » (32-sh)); \ 

(vT).ul[2] = ((vA).ul[2] « sh) j ((vA).ul[3] » (32-sh)); \ 
(vT).ul[3] = (vA).ul[33 « sh; \ 

#define VSLDOI ( vT, vA, vB, UIMM ) \ 
{ \ 

VMX reg v; \ 

ulong i, j , sh; \ 

sh = (UIMM) & Oxf; \ 

for ( i = 0; i < (16-sh); i++ ) \ 

v.uc[i] = (vA) .uc [i+sh] ,- \ 
for ( j = i; j < 16; j++ ) \ 

v.uctj] = (vB) .uc[j-i] ; \ 
for ( i = 0; i < 4; i++ ) \ 

(vT) .ul [i] = v.ul [i] ; \ 

} 

#define VSLB ( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 
sh = (vB).uc[i] & 0x7; \ 
(vT).uc [i] = (vA).ucCi] << sh; \ 

) )x 

#define VSLH( vT, vA, vB ) \ 

{ \ 

ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 
sh = (vB) .us [i] & Oxf; \ 
(vT).us[i] = (vA).us[i] << sh; \ 

#define VSLO ( vT, vA, vB ) \ 
{ \ 

ulong i, j, sh; \ 

sh = ((vB).ul[3] >> 3) & Oxf; \ 

for ( i = 0; i < (16-sh); i++ ) \ 

(vT).uc[i] = (vA) .uc [i+sh] ; \ 
for ( j = i; j < 16; j++ ) \ 

(vT) .uc [j] =0; \ 

#define VSLW( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 4; i++ ) { \ 
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sh = (vB) .ul [i] & Oxlf; \ 
(vT).ul[i] = (vA).ul[i] << sh; \ 

, ,x 

#define VSR ( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

sh = (vB) .ul [3] & 0x7; \ 

(vT).ul[3] = ((vA).ul[3] » sh) | ((vA).ul[2] « (32-sh)); \ 
(vT).ul[2] = ((vA).ul[2] » sh) | ((vA).ul[13 « (32-sh)); \ 
(vT).ul[l] = ((vA).ulfl] » sh) | ((vA).ul[0] « (32-sh)); \ 
(vT).ul[0] = (vA).ul[0] » sh; \ 

#define VSRAB ( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 16; i++ ) { \ 
sh = (vB) .uc [i] & 0x7; \ 

(vT).c[i] = (vA).c[i] » sh; \ 

» ,X 

#define VSRAH( vT, vA, vB ) \ 
{ \ 

1== ulong 1, sh; \ 

O for { i = 0; i < 8; i++ ) { \ 

i5 =-- : sh = (vB).us[i] & Oxf; \ 

; =rf (vT).s[i] = (vA).s[i] » sh; \ 

i 

#define VSRAW( vT, vA, vB ) \ 
t { \ 

\~4 ulong i, sh; \ 

m for ( i = 0; i < 4; i++ ) { \ 

sh = (vB) .ul [i] & Oxlf; \ 
L (vT).l[i] = (vA).l[i] » Sh; \ 

R } x 

s'T #define VSRB ( vT, vA, vB ) \ 

C ( \ 

4=5 ulong i, sh; \ 

« for ( i = 0; i < IS; i++ ) { \ 

sh = (vB).uc[i] & 0x7; \ 
(vT).uc[i] = (vA).uc[i] » sh; \ 

#define VSRH( vT, vA, vB ) \ 
{ \ 

ulong i, sh; \ 

for ( i = 0; i < 8; i++ ) { \ 
sh = (vB) .us [i] & Oxf; \ 
(vT).usIi] = (vA).us[i] » sh; \ 

#define VSRO ( vT, vA, vB ) \ 
{ \ 

long i, j, sh; \ 

sh = ((vB).ul[3] >> 3) & Oxf; \ 
for ( i = 15; i >= Sh; i-- ) \ 

(vT).uc[i] = (vA) .uc[i-sh] ; \ 
for ( j = i; j >= 0; j-- ) \ 
(vT) .uc [j] = 0; \ 

#define VSRW( vT, vA, vB ) \ 

{ \ 

ulong 1, sh; \ 

for ( i = 0; i < 4; i++ ) { \ 
sh = (vB) .ul [i] & Oxlf; \ 



Page No. 327 



EV 093 931 868 US 
Page No. 354 

salppc.h 

(vT).ul[i] = (vA).ul[i] >> sh; \ 

. M 

#define VSPLTB { vT, vB, UIMM ) \ 
{ \ 

uchar c; \ 
ulong i; \ 

C = (vB) .uc[C INDEX MUNGE ( UIMM ) & Oxf ] ; \ 
for ( i = 0; i < 16; i++ ) \ 
(vT) .uc [i] = c; \ 

#define VSPLTH ( vT, vB, UIMM ) \ 
{ \ 

ushort s; \ 
ulong i; \ 

s = (vB).us[S INDEX_MUNGE( UIMM ) & 0x7]; \ 
for ( i = 0; i < 8; i++ ) \ 
(vT) .us [i] = S; \ 

#define VSPLTW ( vT, vB, UIMM ) \ 
{ \ 

ulong 1, 1; \ 

1 = (vB).ul[L INDEX_MUNGE( UIMM ) & 0x3] ; \ 
U for ( i = 0; i < 4; i++ ) \ 

n (vT) .ul[i] =1; \ 

Z- } 

#define VSPLTISB( vT, SIMM ) \ 

^0 { \ 

,f= ulong i; \ 

* for ( i = 0; i < 16; i++ ) \ 

^ (vT).c[i] = (char) (SIMM) ; \ 

m #define VSPLTISH( vT, SIMM ) \ 

{ \ 

ulong i; \ 

t] for ( i = 0; i < 8; i++ ) \ 

|.| (vT).s[i] = (short) (SIMM) ; \ 

I s * #define VSPLTISW( vT, SIMM ) \ 

f { \ 

:»*=! ulong i ; \ 

5* for ( 1 = 0; i < 4; i++ ) \ 

r y (vT).l[i] = (long) (SIMM) ; \ 

I 

ttdefine VSUBFP ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

float a, b, c; \ 

for ( i = 0; i < 4; i++ ) { \ 

a = (vA) .f [i] ; \ 

b = (vB) .f [i] ; \ 

c = a - b; \ 

(vT) .f [i] = c; \ 

#define VSUBSBS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long itemp,- \ 

for ( i = 0; i < 16; i++ ) { \ 

itemp = (long) (vA) .c[i] - ( long) (vB) . c [i] ; \ 
if ( itemp < -128 ) (vT).c[i] = -128; \ 
else if ( itemp > 127 ) (vT).c[i] = 127; \ 
else (vT).c[i] = (char) itemp; \ 

I ' 

ttdefine VSUBSHS ( vT, vA, vB ) \ 
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{ \ 

ulong i; \ 
long itemp; \ 

for { i = 0; i < 8; i++ ) { \ 

itemp = (long) (vA) .s[i] - ( long) (vB) . s [i] ; \ 
if ( itemp < -32768 ) (vT).s[i] = -32768; \ 
else if ( itemp > 32767 ) (vT).s[i] = 32767; \ 
else (vT).s[i] = (short) itemp; \ 

#define VSUBSWS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 
long itemp; \ 

for ( i = 0; i < 4; i++ ) { \ 

itemp = (vA).l[i] - (vB).l[i] ; \ 

if ( ( (vA).l[i] >= 0) && ( (vB).l[i] < 0) && (itemp < 0) ) \ 

(vT).l[i] = (long) 0x7fffffff ; \ 
else if ( ( (vA).l[i] < 0) && ( (vB).l[i] > 0) && (itemp > 0) ) \ 

(vT).l[i] = (long) 0x80000000; \ 
else (vT).l = itemp[i] ; \ 

,H* #define VSUBUBM ( vT, vA, vB ) \ 

o { \ 

fa ulong x; \ 

" for ( i = 0; i < 16; i++ ) \ 

# (vT).uc[i] = (vA).uc[i] - (vB) .uc[i] ; \ 

#define VSUBUBS ( vT, vA, vB ) \ 
{ \ 

M ulong i; \ 



■ 



for ( i = 0; i < 16; i++ ) { \ 

if ( (vA).uc[i] <= (vB).uc[i] ) (vT).uc[i] 
else (vT).uc[i] = (vA).uc[i] - (vB).uc[i]; 

#define VSUBUHM ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).us[i] = (vA).us[i] - (vB) .us [i] ; \ 



#define VSUBUHS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) { \ 

if ( (vA).us[i] <= (vB).us[i] ) (vT).us[i] = 
else (vT) .us [i] = (vA).us[i] - (vB).us[i] ; \ 

} lx 

#define VSUBUWM( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).ul[i] = (vA).ul[i] - (vB).ul[i] ; \ 

#define VSUBUWS ( vT, vA, vB ) \ 
{ \ 

ulong i ; \ 

for ( i = 0; i < 4; i++ ) { \ 

if ( (vA).ul[i] <= (vB).ul[i] ) (vT).ul[i] = 
else (vT).ul[i] = (vA) .ul [i] - (vB).ul[i] ; \ 

#define VSUMSWS ( vT, vA, vB ) \ 
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{ \ 

ulong i ; \ 
double sura; \ 

sum = (double) (vB) .1 [L INDEX_MUNGE ( 3 )]; \ 
for ( i = 0; i < 4; i++ ) \ 

sum += (double) (vA) .l[i] ; \ 
if ( sum > (double) (0x7fffffff) ) \ 

(vT).l[L INDEX MUNGE ( 3 )] = 0x7fffffff; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT) . 1 [L_INDEX_MUNGE ( 3 )] = 0x80000000; \ 

else \ 

(vT) . 1 [L_INDEX_MUNGE ( 3 )] = (long) sum; \ 

ttdefine VSUM2SWS ( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

double suml, sum2; \ 

suml = (double) (vB) .1 [L INDEX MUNGE ( 1 )]; \ 
sum2 = (double) (vB) .1 [L_INDEX_MUNGE( 3 )]; \ 
for ( i = 0; i < 2; i++ ) { \ 

suml += (double) (vA) . 1 [L INDEX MUNGE ( i )]; \ 
sum2 += (double) (vA) .1 [L_INDEX_MUNGE ( i+2 )]; \ 

} \ 

F*' if ( suml > (double) (0x7fffffff) ) \ 

13 (vT).l[L INDEX MUNGE ( 1 )] = 0x7fffffff; \ 

else if ( suml < (double) (0x80000000) ) \ 
'"S, (vT) . 1 [L_INDEX_MUNGE ( 1 )] = 0x8 000000 0; \ 

else \ 

»q (vT) . 1 [L_INDEX MUNGE ( 1 )] = (long) suml; \ 

,fi if ( sum2 > (double) (0x7fffffff) ) \ 

SI (vT).l[L INDEX MUNGE ( 3 )] = 0x7fffffff; \ 

U else if ( sum2 < (double) (0x80000000) ) \ 

yl (vT) . 1 [L_INDEX_MUNGE ( 3 )] = 0x80000000; \ 

else \ 

; iis = (vT).l[L INDEX MUNGE ( 3 )] = (long)sum2; \ 

^ } 

|jj #define VSUM4SBS( vT, vA, vB ) \ 

r II ulong i , j; \ 

4= double sum; \ 

Q for ( i = 0; i < 4; i++ ) { \ 

e="; sum = (double) (vB) .l[i] ; \ 

5^ for ( j = 0; j < 4; j++ ) { \ 

sum += (double) (vA) . c[4*i + j] ; \ 
if ( sum > (double) (0x7fffffff) ) \ 

(vT) .1 [i] = 0x7fffffff ; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT).l[i] = 0x80000000; \ 
else \ 

(vT) .1 [i] = (long) sum; \ 

#define VSUM4SHS( vT, vA, vB ) \ 
{ \ 

ulong i , j ; \ 
double sum; \ 

for ( i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .1 [i] ; \ 
for ( j = 0; j < 2; j++ ) { \ 

sum += (double) (vA) .s[2*i + j] ; \ 
if ( sum > (double) (0x7fffffff) ) \ 

(vT).l[i] = 0x7fffffff; \ 
else if ( sum < (double) (0x80000000) ) \ 

(vT).l[i] = 0x80000000; \ 
else \ 

(vT) .1 [i] = (long) sum; \ 
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#def ine VSUM4UBS ( vT, vA, vB ) \ 
{ \ 

ulong i, j ; \ 
double sum; \ 

for { i = 0; i < 4; i++ ) { \ 
sum = (double) (vB) .ul[i] ; \ 
for ( j = 0; j < 4; j++ } { \ 
sum += (double) (vA).uc[4*i 
if ( sum > (2.0 * (double) (C 

(vT).ul[i] = Oxffffffff; 
else \ 

(vT).ul[i] = (ulong)sum; 

#define VUPKHSB BE ( vT, vB ) \ 
{ \ 

long i; \ 

for ( i = 7; i >= 0; i-- ) \ 

(vT).s[i] = (short) (vB) .c[i] ; x 

#define VUPKHSH BE ( vT, vB ) \ 
{ \ 

long i; \ 

for ( i = 3; i >= 0; i-- ) \ 

(vT).l[i] = (long) (vB) .s [i] ; \ 

#define VUPKLSB BE ( vT, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 8; i++ ) \ 

(vT).s[i] = (short) (vB) .c [i+8] , 

#define VUPKLSH BE ( vT, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 

(vT).l[i] = (long) (vB) .s [i+4] ; 

} 

#if defined ( LITTLE END IAN ) 
ttdefine VUPKHSB { vT, vB ) 
#define VUPKHSH ( vT, vB ) 
ttdefine VUPKLSB ( vT, vB ) 
#define VUPKLSH( vT, vB ) 
#else 

#define VUPKHSB ( vT, vB ) 
#define VUPKHSH ( vT, vB ) 
ttdefine VUPKLSB ( vT, vB ) 
#define VUPKLSH ( vT, vB ) 
#endif 

#define VXOR( vT, vA, vB ) \ 
{ \ 

ulong i; \ 

for ( i = 0; i < 4; i++ ) \ 
(vT) .ul [i] = (vA) .ul [i] A 

} 



VUPKLSB 


BE ( 


vT, 


vB ) ; 


VUPKLSH 


BE ( 


vT, 


vB ) ; 


VUPKHSB 


BE ( 


vT, 


vB ) ; 


VUPKHSH_ 


_BE ( 


vT, 


vB ) ; 


VUPKHSB 


BE ( 


vT, 


vB ) ; 


VUPKHSH 


BE ( 


vT, 


vB ) ; 


VUPKLSB 


BE { 


vT, 


vB ) ; 


VUPKLSH_ 


_BE( 


vT, 


vB ) ; 


-ul [i] ; 


\ 







#endif 
/* 



/* end BUILD_MAX */ 
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#define VRSAVE_COND 7 /* recommended VR condition bit */ 

/* 

* macros to save and restore the CR register 
*/ 

ttdefine SAVE CR 
ttdefine REST_CR 

/* 

* macros to save and restore the LR register 
*/ 

ttdefine SAVE LR 
#define RESTJLR 

1 * GET FPR SAVE AREA places the start of the FPR save area into a register 

* GET_GPR_SAVE_AREA places the start of the GPR save area into a register 

* For MAX only: 

* GET_VR_SAVE_AREA places the start of the VR save area into a register 
*/ 

#define GET GPR SAVE AREA ( ptr ) \ 
N= ptr = (long) { ( (ulong) gpr_save_area + 15) & -15); 

PI #define GET FPR SAVE AREA { ptr ) \ 

'*ZL ptr = (long) (( (ulong) f pr_save_area + 15) & -15); 

Hp #if defined ( BUILD MAX ) 

: .q ttdefine GET VR SAVE AREA ( ptr ) \ 

!r ptr = (long) (( (ulong) vr_save_area + 15) & -15); 

W #endif 

01 

: /* 

_t_ * macros to allocate and free space on the user stack. 

13 * For C implementation, the size is limited to 4096 bytes. 

#define PUSH STACK ( nbytes ) \ 
^2 sp = (long) (( (ulong) stack + 15) & -15); 

*p 

m #define POP_STACK ( nbytes ) \ 

m sp = 0; 

#define ALLOCATE STACK SPACE ( ptr, nbytes ) \ 
PUSH STACK ( nbytes ) \ 
ptr = sp; 

#define FREE_STACK_SPACE ( nbytes ) POP_STACK( nbytes ) 

ttdefine CREATE_STACK FRAME ( nbytes ) \ 
PUSH_STACK( nbytes ) 

#define CREATE STACK FRAME X( nbytes ) \ 
CREATE_STACK_FRAME ( nbytes ) 

ttdefine DESTROY_STACK_FRAME \ 
sp = 0; 

#define CREATE STACK BUFFER ( bufferp, byte_align, nbytes ) \ 
ALLOCATE_STACK_SPACE ( bufferp, nbytes ) 

#define CREATE STACK BUFFER X( bufferp, byte_align, nbytes ) \ 
CREATE_STACK_BUFFER ( bufferp, byte_align, nbytes ) 

#define DESTROY_STACK_BUFFER \ 
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f 

o 



/* 

* macros to create salcache from the stack, used in ucode only 
*/ 

#define CREATE STACK SALCACHE \ 

char localcachebuffer [SALCACHE_ALLOC_SIZE] ; 

#define DESTROY_STACK_SAL CACHE 

/* 

* macros for saving and restoring non-volatile 

* floating point registers (FPRs) 
*/ 
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macros for saving and restoring non-volatile 
* general purpose registers (GPRs) 

3 */ 

3 
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r29 


#def ine 


SAVE 


rl5 


r30 


#def ine 


SAVE 


_rl5_ 


_r31 


#def ine 


REST 


rl5 




#def ine 


REST 


rl5 


rl6 


#def ine 


REST 






#define 


REST 


rl5 


rl8 


#define 


REST 


rl5 


rl9 


#define 


REST 


rl5 


r20 


#def ine 


REST 


rl5 


r21 


ttdefine 


REST 


rl5 


r22 


ttdefine 


REST 


rl5 


r23 


#define 


REST 


rl5 


r24 


ttdefine 


REST 


rl5 


r25 


#def ine 


REST 


rl5 


r2 6 


#def ine 


REST 


rl5 


r27 
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#def ine 


REST 




r28 


#def ine 


REST 


15 


29 


#def ine 


REST 


r 




#def ine 


REST_ 


- r - 




#def ine 


SAVE 


rl6 




#def ine 


SAVE 


rl6 


rl7 


#def ine 


SAVE 


rl6 


r!8 


#def ine 


SAVE 


r!6 


rl9 


ttdefine 


SAVE 


rl6 


r2 0 


#def ine 


SAVE 






#def ine 


SAVE 


rl6 


r22 


#def ine 


SAVE 


rl6 


r23 


#def ine 


SAVE 


rl6 


r24 


#def ine 


SAVE 


T '-\ \ 


r ?fi 


#def ine 


SAVE 


r 


r 


#def ine 


SAVE 




r 


#def ine 


SAVE 


1 1 fi 
r 


r 


#def ine 


SAVE 






#def ine 


SAVE 




30 


#def ine 


SAVE 


g 

- 


r31 

- 


#def ine 


REST 






#def ine 


REST 




17 


#def ine 


REST 




18 

r 


#def ine 


REST 






ttdefine 


REST 


r 


r 


ttdefine 


REST 






ttdefine 


REST 


rl6 


r22 


ttdefine 


REST 


rl6 


r23 


ttdefine 


REST 


rl6 


r24 


ttdefine 


REST 


rl6 


r2 5 


ttdefine 


REST 


rl6 


r2 6 


ttdefine 


REST 


rl6 


r27 


ttdefine 


REST 


rl6 


r2 8 


ttdefine 


REST 


rl6 


r29 


ttdefine 


REST 


rl6 


r3 0 


ttdefine 


REST 


r!6 


r31 


/* 









* VMX registers 
*/ 

ttdefine USE THRU vO ( cond ) 
ttdefine USE THRU vl ( cond ) 
ttdefine USE THRU v2 ( cond ) 
ttdefine USE THRU v3 ( cond ) 
ttdefine USE THRU v4 ( cond ) 
ttdefine USE THRU v5 ( cond ) 
ttdefine USE THRU v6 ( cond ) 
ttdefine USE THRU W ( cond ) 
ttdefine USE THRU v8 ( cond ) 
ttdefine USE THRU v9 ( cond ) 
ttdefine USE THRU vlO ( cond > 
ttdefine USE THRU vll ( cond ) 
ttdefine USE THRU vl2 ( cond ) 
ttdefine USE THRU vl3 ( cond ) 
ttdefine USE THRU vl4 ( cond ) 
ttdefine USE THRU vl5 ( cond ) 
ttdefine USE THRU vl6 ( cond ) 
ttdefine USE THRU vl7 ( cond ) 
ttdefine USE THRU vl8 ( cond ) 
ttdefine USE THRU vl9 ( cond ) 
ttdefine USE THRU v20 ( cond ) 
ttdefine USE THRU v21 { cond ) 
ttdefine USE THRU v22 ( cond ) 
ttdefine USE THRU v23 ( cond ) 
ttdefine USE_THRU_v24 ( cond ) 
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- 

m 
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#def ine 


USE THRU v2 5 ( 


cond 


#def ine 


USE THRU v26 ( 


cond ' 


#define 


USE THRU v27 ( 


cond ' 


#def ine 


USE THRU v28 ( 


cond ' 


#def ine 


USE THRU V29 ( 


cond ' 


#def ine 


USE THRU v3 0 ( 


cond ' 


#define 


USE THRU v31 ( 


cond ' 


ttdefine 


FREE 


THRU 


vO ( 


cond ) 


#define 


FREE 


THRU 


VI ( 


cond ' 


#define 


FREE 


THRU 


v2( 


cond ' 


#def ine 


FREE 


THRU 


v3( 


cond ' 


ttdefine 


FREE 


THRU 


v4 ( 


cond ' 


#def ine 


FREE 


THRU 


v5( 


cond ' 


#def ine 


FREE 


THRU 


v6 ( 


ond ' 


#def ine 


FREE 


THRU 


v7 ( 


cond ' 


#def ine 


FREE 


THRU 


v8 ( 


cond ' 


#def ine 


FREE 


THRU 


v9 ( 


cond ' 


#def ine 


FREE 


THRU 


vlO 


cond 


#def ine 


FREE 


THRU 


vll 


cond 


#def ine 


FREE 


THRU 


vl2 


cond 


#def ine 


FREE 


THRU 


vl3 


cond 


#def ine 


FREE 


THRU 


vl4 


cond 


#def ine 


FREE 


THRU 


vl5 


cond 


#define 


FREE 


THRU 


vl6 


cond 


#def ine 


FREE 


THRU 


vl7 


cond 


#def ine 


FREE 


THRU 


v!8 




#def ine 


FREE 


THRU 


vl9 




#define 


FREE 


THRU 


v2 0 




ttdefine 


FREE 


THRU 


v21 




#define 


FREE 


THRU 


v22 


cond 


#define 


FREE 


THRU 


v2 3 


( cond 


#def ine 


FREE 


THRU 


v24 


( cond 


#def ine 


FREE 


THRU 


v2 5 


cond 


#define 


FREE 


THRU 


v2 6 


cond 


ttdefine 


FREE 


THRU 


V2 7 


cond 


#define 


FREE 


THRU 


v2 8 


( cond 


#def ine 


FREE 


THRU 


v2 9 


( cond 


#def ine 


FREE 


THRU 


v3 0 


( cond 


#def ine 


FREE 


THRU 


v31 


( cond 


#endif 










/* 











/* end SALPPC_H */ 



END OF FILE salppc. h 
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#if i defined ( SALPPC_INC ) 
#define SALPPC_INC 



MC Standard Algorithms 



PPC Version 



For example, vadd.mac * 



File Name: 
Description: 

Source files should have extension .mac. 
and must include this file (salppc. inc) . 

To assemble for PPC ucode, use the following basic 
makefile build rule: 

.SUFFIXES: .mac .c . s .o 

.tnac.o: 

cp $*.mac $*.c 
ccmc -o $*.s -E $*.c 
ccmc -c -o $*.o $*.s 
rm - f $ * . s 
rm - f $ * . c 

To compile for C, use the following basic makefile build rule: 

. SUFFIXES : .mac .c .o 

. mac . o : 

cp $*.mac $*.c 

ccmc -DCOMPILE_C -c -o $*.o $*.c 
rm -f $*.c 

The first 8 function arguments are passed in GPR registers 
r3 - rlO. Arguments beyond 8 are passed on the stack and may 
be obtained with the GET_ARG8 , GET_ARG9 , ... GET ARG15 macros. 
Additional GPR registers should be assigned in ascending order 
starting from the last function argument. These may be declared 
with the DECLARE_rx [ ry] macros. For example, a function with 
5 arguments that requires 3 additional GPR registers would 
issue: DECLARE r8 rlO. rO, if required, should be declared 
separately with the DECLARE rO macro. GPR registers above rl2 
must be saved and restored using the SAVE_rl3 [_ry] and 
REST__rl3 [_ry] macros, respectively. 

FPR registers should be assigned in ascending order starting 
with f0[d0] . These may be declared with the DECLARE_f 0 [_fy] 
or DECLARE do [ dy] macros . 

For example, DECLARE fO f 11 . FPR registers above f 13 [dl3] must 
be saved and restored using the SAVE f 14 [ fy] and REST f 14 [_fy] 
or SAVE_dl4 [_dy] and REST_dl4 [_dy] macros, respectively. 

All variables must be assigned a register using the 
pre-processor #define directive. GPR registers are named 
rO - r31; Single precision FPR registers are named fO - f31. 
Double precision FPR registers are named dO - d31. Different 
variables may be assigned to the same register as in: 



Functions must begin with the FUNC_PROLOG macro and end 
with the FUNC EPILOG macro. 



1 
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* 


Macros are 


provided for both Fortran and C entry points. 




* 


The GET SAL CACHE macro should be used to get the address of 


* 


* 
* 


the "current" 


salcache buffer into a GPR register. 


* 
* 


* 
* 


Avoid terminating macro lines with a semicolon. 


* 
* 




The following example demonstrates typical usage: 


* 
* 


* 


ttinclude " salppc . inc " 


* 
* 




/* 






* 


* 


* assign variables to registers 


* 




*/ 






* 


* 


#def ine 


A 


r3 


* 




#def ine 


I 


r4 


* 




#define 


B 


r5 




* 


#def ine 


J 


r6 


* 


* 


#def ine 


C 


r7 


* 




#define 


K 


r8 


* 




#def ine 


D 


r9 


* 


* 


#def ine 


L 


rlO 


* 


* 


#define 


N 


rl2 


* 


* 


#define 


EFLAG rll 


* 


± 


#define 


count rll 






ftdefine 


to 


rl3 




t 


#define 


tl 


rl3 






#define 


t2 


rl4 






#define 


t3 


rl4 






#define 


t4 


rl5 


* 




#define 


t5 


rl5 


* 


* 


#define 


t6 


rl6 






#define 


aO 


fO 






#def ine 


al 


fl 






#def ine 


a2 


f2 






#def ine 


a3 


f3 






#def ine 


bO 


f4 




* 


#def ine 


bl 


f5 


* 


* 


#def ine 


b2 


f6 


I 




#def ine 


b3 


f7 




* 


#define 


cO 


f 8 






#define 


cl 


f9 






#def ine 


c2 


flO 


* 




#def ine 


C3 


fll 




* 


#define 


do 


fl2 


* 


* 


#define 


dl 


fl3 






#def ine 


d2 


fl4 




* 


#def ine 


d3 


fl5 


* 


* 


FUNC_PROLOG 


/* must precede function */ 






#if ! defined ( 


COMPILE C ) 


* 




U ENTRY (foo ) 






FORTRAN 


DREF 4(1, J, K, L) 




* 


F0RTRAN_ 


_DREF_ARG8 


* 




U ENTRY (foo) 






LI (EFLAG, 0) 






BR (common) 








U ENTRY (foo x ) 






FORTRAN 


DREF 4(1, J, K, L) 






FORTRAN 


DREF ARG8 






FORTRAN 


JDR 


:F_ARG9 






#endif 
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o 
5 



ENTRY 10 (foo x, A, I, 
DECLARE rl3 rl6 
DECLARE fO fl5 
GET_ARG9( EFLAG ) 

LABEL (common) 

SAVE CR 
SAVE rl3 rl6 
SAVE fl4_fl5 
SAVE_LR 

GET_ARG8 ( N ) 



B, J, C, K, D, L, N, EFLAG) 
/* get the 9 1 th arg (EFLAG) off stack */ ■ 

/* needed if using fields 2,3 or 4 */ 

/* needed if making a function call */ 
/* get the 8 ' th arg (N) off stack */ 



body of function . . . 



REST CR 
REST rl3 rl6 
REST fl4_fl5 
REST LR 
RETURN 

FUNC_EPILOG 



/* must conclude function */ 



Mercury Computer Systems, Inc. 
Copyright (c) 1996 All rights reserved 



Engineer; Reason 



0.2 

0.3 
0.4 
#endif 



970521 
980813 



jg; Created 

jfk; Added POSTING BUFFER COUNT and made 

TEST IF DCBZ macro time "stw" instead 
of doing the TEST IF DCBT macro (lwz) 

jfk; Added SAL CACHE ALLOC SIZE , 

ALIGN SALCACHE, CREATE_SALCACHE_FRAME 
DESTROY SALCACHE FRAME 

jfk; Added SET DCB [TZ] COND macros. 
Made old macros not assemble 

jfk; Changes SALCACHE ALLOC SIZE for 750 

/* header */ 



#if ! defined ( BUILD_603 ) && ! defined ( BUILD 750 ) && !defined( BUILD_MAX ) 

#error You must define BUILD_603 or BUILD_750 or BUILD_MAX 
#endif 



* define single precision floating point field sizes, 

* limits, and values 
*/ 

#define F FLOAT SIZE 32 
#define F FRAC SIZE 23 
#define F HIDDEN SIZE 1 
#define F EXP SIZE 8 
#define F SIGN SIZE 1 

#define F SIGN BIT (F FLOAT SIZE - F SIGN SIZE) 
#define F EXP MASK ((1 « F EXP SIZE) - 1) 
#define F EXP BIAS ((1 « ( F_EXP_SIZE- 1) ) - 1) 
#define F MAX EXP F EXP BIAS 
#define F_MIN_EXP ( - (F_EXP_BIAS-1) ) 

/* 

* define double precision floating point field sizes, 

* limits, and values 
*/ 

#define D_FLOAT_SIZE 64 
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^define D FRAC SIZE 
#define D HIDDEN SIZE 
ttdefine D EXP SIZE 
#define D SIGN SIZE 
#define D SIGN BIT 
#define D EXP MASK 
#define D EXP BIAS 
ttdefine D MAX EXP 
#define D_MIN_EXP 



11 
1 

(D FLOAT SIZE - D SIGN SIZE) 
( (1 « D EXP SIZE) - 1) 
((1 « (D_EXP_SIZE-1) ) - 1) 
D EXP BIAS 
(- (D_EXP_BIAS-1) ) 



#if defined ( BUILD_603 ) 

#define LOG2_CACHE_SIZE (14) /* Log (base 2) of 603 data cache */ 
#elif defined ( BUILD_750 ) | | defined ( BUILD_MAX ) 



#def ine 
*/ 

#endif 

#define 
ttdefine 
#define 
ttdefine 
ttdefine 
#define 
#def ine 

ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 

ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 

ttdefine 
ttdefine 
ttdefine 

ttdefine 
ttdefine 
ttdefine 



LOG2_CACHE_SIZE 



(15) 



/* Log (base 2) of 750 or MAX data cache 



(LOG2 CACHE SIZE) 

(L0G2 CACHE SIZE - 

(LOG2 CACHE SIZE - 

(L0G2 CACHE SIZE - 

(LOG2 CACHE SIZE - 

(LOG2 CACHE SIZE - 

(LOG2_CACHE_SIZE - 

(1 << LOG2 CACHE_SIZE) 
(CACHE SIZE) 
(CACHE SIZE >> 1) 
(CACHE SIZE » 2) 
(CACHE SIZE >> 2) 
(CACHE SIZE >> 3) 
(CACHE SIZE >> 3) 
(CACHE_SIZE >> 4) 



LOG2 CACHE BSIZE 
LOG2 CACHE HSIZE 
LOG2 CACHE LSIZE 
LOG2 CACHE FSIZE 
LOG2 CACHE DSIZE 
LOG2 CACHE CSIZE 
LOG2_CACHE_Z SIZE 



CACHE SIZE 
CACHE BSIZE 
CACHE HSIZE 
CACHE LSIZE 
CACHE FSIZE 
CACHE DSIZE 
CACHE CSIZE 
CACHE ZSIZE 



LOG2 CACHE LINE_SIZE 5 

CACHE LINE SIZE (1 << LOG2 CACHE_LINE SIZE) 
CACHE LINE LSIZE (CACHE LINE SIZE >> 2) 
CACHE LINE MASK (CACHE LINE SIZE - 1) 
CACHE_LINE_ADDR_MASK (OxffffffeO) 

LOG2 SALCACHE ALIGN 6 

SALCACHE ALIGN (1 << LOG2 SALCACHE ALIGN) 
SALCACHE_ALIGN_MASK (SALCACHE_ALIGN - 1) 



SALCACHE SIZE 
SALCACHE EXTRA SIZE 
SALCACHE_ALLOC_SIZE 



CACHE SIZE 
(SALCACHE ALIGN 
(SALCACHE_SIZE + 



h 64) 

SALCACHE_EXTPA_S I ZE ) 



Define memory vector non-cache (N) / cache (C) FLAG values for 
Enhanced SAL calls (final argument) . The letters in the symbol 
correspond to the vectors in the call, moving from left to right 
so, for example: 

for VMULX, there are the following 8 possibilities: 

A, B, C all not in cache 

A, B not in cache, C in cache 

A, C not in cache, B in cache 
A not in cache, B, C in cache 

B, C not in cache, A in cache 
B not in cache, A, C in cache 
C not in cache, A, B in cache 





VMULX 


(A, 


I, I 


3, J, 


C, 


K, 


N, 


SAL 


NNN) 




VMULX 


(A, 


I, I 


i, J, 


c. 


K, 


N, 


SAL 


NNC) 




VMULX 


(A, 


I, I 


3, J, 


C, 


K , 


N, 


SAL 


NCN) 




VMULX 


(A, 


I , I 


3, J, 


c, 


K, 


N, 


SAL 


NCC) 




VMULX 


(A, 


I, I 


3, J, 


C, 


K, 


N, 


SAL 


CNN) 




VMULX 


(A, 


I , 


3, J, 


c, 


K, 


N, 


SAL 


CNC) 




VMULX 


(A, 


I, 


3, J, 


c, 


K, 


N , 


SAL 


CCN) 



4 
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C, K, N, SAL_CCC) A, B, C all in cache 



* 1 vector algorithms 
*/ 

#define SAL N 0 
#define SAL_C 1 

/* 

* 2 vector algorithms 
*/ 

^define SAL NN 0 

^define SAL NC 1 

define SAL CN 2 

define SAL_CC 3 

3 vector algorithms 



*/ 








#define 


SAL 


NNN 


0 


#define 


SAL 


NNC 


1 


#def ine 


SAL 


NCN 


2 


#define 


SAL 


NCC 


3 


#define 


SAL 


CNN 


4 


#define 


SAL 


CNC 


5 


#define 


SAL 


CCN 


6 


#def ine 


SAL 


CCC 


7 


/* 









4 vector algorithms 



#def ine 


SAL 


NNNN 


0 


ttdefine 


SAL 


NNNC 


1 


#def ine 


SAL 


NNCN 


2 


#def ine 


SAL 


NNCC 


3 


#define 


SAL 


NCNN 


4 


ttdefine 


SAL 


NCNC 


5 


#define 


SAL 


NCCN 


6 


#def ine 


SAL 


NCCC 


7 


ttdefine 


SAL 


CNNN 


8 


ttdefine 


SAL 


CNNC 


9 


ttdefine 


SAL 


CNCN 


10 


ttdefine 


SAL 


CNCC 


11 


ttdefine 


SAL 


CCNN 


12 


ttdefine 


SAL 


CCNC 


13 


ttdefine 


SAL 


CCCN 


14 


ttdefine 


SAL 


CCCC 


15 


/* 

* 5 vector algorithms 


ttdefine 


SAL 


NNNNN 


0 


ttdefine 


SAL 


NNNNC 


1 


ttdefine 


SAL 


NNNCN 


2 


ttdefine 


SAL 


NNNCC 


3 


ttdefine 


SAL 


NNCNN 


4 


ttdefine 


SAL 


NNCNC 


5 


ttdefine 


SAL 


NNCCN 


6 


ttdefine 


SAL 


NNCCC 


7 


ttdefine 


SAL 


NCNNN 


8 


ttdefine 


SAL 


NCNNC 


9 


ttdefine 


SAL 


NCNCN 


10 


ttdefine 


SAL 


NCNCC 


11 


ttdefine 


SAL 


NCCNN 


12 


ttdefine 


SAL 


NCCNC 


13 


ttdefine 


SAL 


NCCCN 


14 



5 
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#def ine 


SAL 


NCCCC 


- g 


#def ine 


SAL 


CNNNN 


16 


#def ine 


SAL 


CNNNC 


17 


ttdefine 


SAL 


CNNCN 


18 


#def ine 


SAL 


CNNCC 


19 


ttdefine 


SAL 


CNCNN 


20 


#define 


SAL 


CNCNC 


21 


#def ine 


SAL 


CNCCN 


22 


ttdefine 


SAL 


CNCCC 


23 


ttdefine 


SAL 


CCNNN 


24 


#def ine 


SAL 


CCNNC 


25 


tfdefine 


SAL 


CCNCN 


26 


^define 


SAL 


CCNCC 


27 


^define 


SAL 


CCCNN 


28 


#def ine 


SAL 


CCCNC 


29 


#def ine 


SAL 


CCCCN 


30 


#def ine 


SAL 


CCCCC 


31 


/* 









* defi: 

*/ 
#define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
ttdefine 
#define 
#def ine 
#def ine 
#define 
#define 
#define 
ttdefine 

/* 



byte offsets 



FFT_s e t up_pp c 6 0 3 e 



SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 
SETUP 



HANDLE 

SMALL TWIDP 

SMALL BITR TWIDP 

SMALL L0G2M 

BIG TWIDP 

BIG XY TWIDP 

BIG L0G2MXY 

BIG L0G2X 

BIG L0G2Y 

BIG STRIPX 

RPASS TWIDP 

RADIX3 TWIDP 

RADIX5JTWIDP 

L0G2M 

L0G2MR 

VMX BITR TWIDP 
VMX_TABLES 



#define PREFETCH CONTROL ( OxFBFFFEO 0 ) 

#define PREFETCH CONTROL H -1024 
#define PREFETCH_CONTROL_L -512 



#define MI SCON B 
#define MISCON B H 
ttdefine MlSCON_B_L 



#def ine 
#def ine 
ttdefine 
#def ine 
#define 
#def ine 
ttdefine 
#def ine 

#def ine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 



(0XFBFFFC18) 

-1024 

-1000 



PREFETCH DISABLED 
PREFETCH AUTO 6 
PREFETCH AUTO 5 
PREFETCH AUTO 4 
PREFETCH AUTO 3 
PREFETCH AUTO 2 
PREFETCH AUTO 1 
PREFETCH_AUTO_0 

PREFETCH MANUAL 0 
PREFETCH MANUAL 2 
PREFETCH MANUAL 4 
PREFETCH MANUAL 6 
PREFETCH MANUAL 8 
PREFETCH_MANUAL_1 0 



/* (OxFBFF + 1) */ 
/* (OxFEOO) */ 



/* (OxFBFF + 1) */ 
/* (0xFC18) */ 



6 
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ttdefine PREFETCH MANUAL 12 
#define PREFETCH_MANUAL_14 



ttdefine USE PR E F ETCH_CONTROL 16 
#define USE_MISCON_B 0 



#define PREFETCH_MASK 



#def ine 
#def ine 
#def ine 
^define 
#def ine 
^define 
^define 
#define 
#define 

#define 
#def ine 
#define 
#define 
#define 
#define 
#define 
#def ine 

/* 



PREFETCH DEFAULT 
PREFETCH OFF 
PREFETCH A6 
PREFETCH A5 
PREFETCH A4 
PREFETCH A3 
PREFETCH A2 
PREFETCH Al 
PREFETCH_A0 

PREFETCH MO 
PREFETCH M2 
PREFETCH M4 
PREFETCH M6 
PREFETCH M8 
PREFETCH M10 
PREFETCH Ml 2 
PREFETCH_M14 



15 

(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE 

(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE 
(USE_ 



PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH^ 

PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 



CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 

CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 
CONTROL 



PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH, 

PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 
PREFETCH 



MANUAL 0) 
DISABLED) 
AUTO 6) 
AUTO 5) 
AUTO 4) 
AUTO 3) 
AUTO 2) 
AUTO 1) 
_AUTO_0 ) 

MANUAL 0) 

MANUAL 2) 

MANUAL 4) 

MANUAL 6) 

MANUAL 8) 

MANUAL 10) 

MANUAL 12) 
MANUAL_14 ) 



macro to compile for PPC assembly (COMPILE_C *not* defined) or 
* C code (COMPILE_C defined) 
*/ 

#if defined ( COMPILE_C ) 
#include " salppc. h" 



* GPR 


regi 


ster 


*/ 






#define 


rO 


0 


#def ine 


sp 


1 


#def ine 


rtoc 


2 


#def ine 


r3 


3 


#define 


r4 


4 


#define 


r5 


5 


#define 


r6 


6 


#define 


r7 


7 


#def ine 


r8 


8 


#def ine 


r9 


9 


#def ine 


rlO 


10 


#def ine 


rll 


11 


#def ine 


rl2 


12 


#def ine 


rl3 


13 


#define 


rl4 


14 


#def ine 


rl5 


15 


#def ine 


rl6 


16 


#define 


rl7 


17 


#define 


rl8 


18 


#define 


rl9 


19 


#define 


r2 0 


20 


#def ine 


r21 


21 


#def ine 


r22 


22 


#def ine 


r23 


23 


#define 


r24 


24 


#define 


r2 5 


25 


#def ine 


r2 6 


26 
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01 



#define 


r27 


27 


#define 


r2 8 


28 


#def ine 


r2 9 


29 


#def ine 


r3 0 


30 


#define 


r31 


31 


/* 






* FPR single pr 


*/ 






#define 


f 0 


0 


#define 


f 1 


1 


#def ine 


f2 


2 


#def ine 


f 3 


3 


#def ine 


f 4 


4 


ttdefine 


f 5 


5 


#def ine 


f 6 


6 


#def ine 


f7 


7 


#def ine 


f8 


8 


#def ine 


f9 


9 


#def ine 


flO 


10 


#def ine 


fll 


11 


#def ine 


fl2 


12 


#def ine 






#def ine 


fl4 


14 


#def ine 


fl5 


15 


#def ine 


f 16 


16 


#def ine 


fl7 


17 


#def ine 


fl8 


18 


#def ine 


fl9 


19 


#def ine 


f20 


20 


ttdefine 


f21 


21 


ttdefine 


f22 


22 


#def ine 


f23 


23 


#define 


f24 


24 


#define 


f25 


25 


#define 


f 26 


26 


#define 


f27 


27 


#def ine 


f28 


28 


#define 


f29 


29 


ttdefine 


f30 


30 


ttdefine 


f31 


31 



/* 

* FPR double precision register equates 
*/ 



ttdefine 


dO 


0 


#define 


dl 


1 


ttdefine 


d2 


2 


#define 


d3 


3 


#def ine 


d4 


4 


ttdefine 


d5 


5 


ttdefine 


d6 


6 


ttdefine 


d7 


7 


ttdefine 


d8 


8 


ttdefine 


d9 


9 


ttdefine 


dlO 


10 


ttdefine 


dll 


11 


ttdefine 


dl2 


12 


ttdefine 


dl3 


13 


ttdefine 


dl4 


14 


ttdefine 


dl5 


15 


ttdefine 


dl6 


16 


ttdefine 


dl7 


17 


ttdefine 


dl8 


18 


ttdefine 


dl9 


19 


ttdefine 


d2 0 


20 


ttdefine 


d21 


21 
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#def ine 


d22 


22 


#define 


d2 3 


23 


ttdef ine 


d24 


24 


#define 


d25 




#def ine 


d26 


26 


^define 


d27 


27 


^define 


d2 8 


28 


^define 


d2 9 




^define 


d30 




define 


d31 




#if defined ( 


BUI LI 


/* 






* VMX 


(g4) 


regis 


*/ 






^define 


v0 




#def ine 






#def ine 


v2 




^define 


v3 






v4 


* 


#def ine 


v 




tfuer me 








v7 




#def ine 


v8 




#define 


v9 




#def ine 


vlO 




#def ine 


vll 


1 ^ 


#def ine 


vl2 




#def ine 


vl3 


13 


#define 


vl4 


14 


#def ine 


vl5 


1 ^ 


#def ine 


vl6 


16 


#def ine 


vl7 


17 


#def ine 


vl8 


18 


#def ine 


vl9 


19 


ttdefine 


v2 0 




#define 


v21 


21 


#def ine 


v2 2 


22 


#def ine 


v23 


23 


#define 


v24 


24 


#define 


v2 5 


25 


#define 


v26 


26 


#def ine 


v2 7 


27 


#def ine 


v28 


28 


#define 


v2 9 


29 


#define 


v30 


30 


#define 


v31 


31 



#endif 

#define FUNC PROLOG \ 
.section .text; \ 
.align 5; 

ttdefine FUNC_EPILOG 

#define TEXT SECTION ( logb2_align ) \ 
.section .text; \ 
.align logb2_align; 

#define DATA SECTION ( logb2_align ) \ 
.section .data; \ 
.align logb2_align; 

#define RODATA SECTION ( logb2_align ) \ 
.section .rodata; \ 
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.align logb2_align; 

#define PCJDFFSET ( nbytes ) (. + (nbytes) ) 

' ' * make a "double" concat to fool the preprocessor so that input 

* arguments get translated before concatenation; otherwise, the 

* concatenated symbol doesn't get translated properly 

fldefine CONCAT ( left, right ) CONCAT NEST ( left, right ) 
^define CONCAT_NEST( left, right ) left##right 

i declarations and definitions 
*/ 

#define EXTERN_DATA( symbol ) 
#define EXTERN_FUNC ( func } 
/* 

* macro for a global declaration 

#define GLOBAL ( symbol ) \ 
.globl symbol 

/* 

* macro for a local declaration 
*/ 

#define LOCAL { symbol ) 
/* 

* macros for creating static arrays 

#define START_ARRAY( name ) \ 
name## : 

#define START C ARRAY ( name ) START ARRAY ( name ) 

#define START UC ARRAY ( name ) START ARRAY ( name ) 

#define START S ARRAY ( name ) START ARRAY ( name ) 

#define START US ARRAY ( name ) START ARRAY ( name ) 

#define START L ARRAY ( name ) START ARRAY { name ) 

#define START UL ARRAY ( name ) START ARRAY ( name ) 

#define START_F_ARRAY { name ) START_ARRAY ( name ) 

#define END_ARRAY 

#define DATA ( type, dl ) \ 
.##type dl 

#define DATA2 ( type, dl, d2 ) \ 
.##type dl, d2 

#define DATA4 ( type, dl, d2, d3, d4 ) \ 
.##type dl, d2, d3 , d4 

#define DATA8 ( type, dl, d2, d3 , d4 , d5, d6 , d7 , d8 ) \ 
.##type dl, d2, d3 , d4 , d5 , d6 , d7, d8 

#define C DATA ( dl ) DATA ( byte, dl ) 

ttdefine UC DATA ( dl ) DATA ( byte, dl ) 

#define S DATA ( dl ) DATA ( short, dl ) 

#define US DATA ( dl ) DATA ( short, dl ) 

#define L DATA ( dl ) DATA ( long, dl ) 

#define UL DATA ( dl ) DATA ( long, dl ) 

ttdefine F_DATA( dl ) DATA ( float, dl ) 

#if defined ( LITTLE_ENDIAN ) 



Page No. 347 



EV 093 931 868 US 
Page No. 374 

salppc.inc 



ttdefine D_DATA( dl , d2 ) 
#else 

#define D_DATA ( dl , d2 ) 
#endif 

#define C DATA2 ( dl, d2 ) 
#define UC DATA2 ( dl , d2 ) 
#define S DATA2 ( dl, d2 ) 
#define US DATA2 ( dl, d2 ) 
#define L DATA2 ( dl, d2 ) 
ttdefine UL DATA2 ( dl , d2 ) 
ttdefine F_DATA2 ( dl, d2 ) 

^define C DATA4 ( dl, d2, d3, d4 ) 
^define UC DATA4 ( dl, d2 , d3 , d4 ) 
idefine S DATA4 { dl , d2 , d3 , d4 ) 
^define US DATA4 ( dl, d2, d3, d4 ) 
#define L DATA4 { dl, d2 , d3 , d4 ) 
#define UL DATA4 ( dl, d2, d3, d4 ) 
#define F_DATA4 ( dl, d2 , d3 , d4 ) 



DATA2 ( long, d2 , dl ) 
DATA2 ( long, dl, d2 ) 



DATA2 ( byte , dl , d2 ) 
DATA2 ( byte , dl , d2 ) 
DATA2 ( short , dl , d2 ) 
DATA2 ( short, dl, d2 ) 
DATA2 ( long, dl, d2 ) 
DATA2 ( long, dl , d2 ) 
DATA2 ( float, dl, d2 ) 



DATA4 ( byte, dl, d2, > 
DATA4 ( byte, dl, d2 , - 
DATA4 ( short , dl , d2 , 
DATA4 ( short , dl , d2 , 
DATA4 ( long, dl, d2 , • 
DATA4 ( long, dl, d2, 
DATA4 ( float, dl, d2. 



d4 ) 
d4 ) 

;, d4 ) 

!, d4 ) 

d4 ) 
d4 ) 
d4 ) 



#define 
#define 
#define 
#def ine 
#define 
#def ine 
#define 



C DATA 8 ( dl , 
DATA8 { byte, 
UC DATA8 ( dl , 
DATA8 ( byte, 
S DATA8 ( dl, 
DATA8 ( short, 
US DATA8 ( dl , 
DATA8 ( Short , 
L DATA8 ( dl , 
DATA8 ( long, 
UL DATA8 ( dl , 
DATA8 ( long, 
F DATA8 ( dl , 
DATA8 ( float. 



d2, d3, d4, 



d5, 
d4, 
, d5, 
d4, 
d5, 



d6, d7, d8 ) \ 
d5, d6, dl. 



d6 



dl, d2, d3, d4, d5, 
d2, d3, d4, d5, d6 , 
dl, d2, d3, d4, d5, 
d2, d3, d4, d5, d6, 
dl, d2, d3, d4, d5, 
d2, d3, d4, d5, d6 



d3, d4, d5 



) 

d8 ) \ 
d7, d8 ) 
d7, d8 ) \ 
d6, d7, d8 ) 
d7, d8 ) \ 
d6, d7, d8 ) 
d7, d8 ) \ 
d6, d7, d8 ) 
d7, d8 ) \ 
d6, d7, d8 ) 
d7, d8 ) \ 
d6, d7, d8 ) 



* macros for creating vmx permute masks (128-bits) 

#if defined ( LITTLE_ENDIAN ) 

#define L PERMUTE MUNGE ( 1 ) ( (1) 
#define S PERMUTE MUNGE ( s ) ( (s) 
#define C_PERMUTE_MUNGE ( c } ( (c) 

#define L INDEX MUNGE ( x ) ( (x) A 
#define S INDEX MUNGE ( x ) ( (x) A 
#define C_INDEX_MUNGE ( x ) ( (x) * 

#else 

#define L PERMUTE MUNGE ( 1 ) ( 1 ) 

#define S PERMUTE MUNGE ( s ) ( s ) 

#define C_PERMUTE_MUNGE ( c ) ( c ) 

ttdefine L INDEX MUNGE { x ) ( x ) 
#define S INDEX MUNGE ( x ) ( x ) 
#define C_INDEX_MUNGE ( x ) ( x ) 

#endif 

ttdefine L PERMUTE MASK( 11, 12, 13, 14 ) \ 
long L PERMUTE MUNGE ( 11 ) , L PERMUTE MUNGE { 12 ) , \ 
L_PERMUTE_MUNGE ( 13 ) , L_PERMUTE_MUNGE ( 14 ) 

#define S PERMUTE MASK ( si, s2, s3, s4 , s5, s6 , s7 , s8 ) \ 
.short S__PERMUTE_MUNGE ( si ) , S_PERMUTE_MUNGE ( s2 ) , \ 



* Oxlclclclc ) 
A Oxlele ) 
A 0x1 f ) 

0x3 ) 
0x7 ) 
Oxf ) 



11 



Page No. 348 



EV 093 931 868 US 
Page No. 375 

salppc . in 



S PERMUTE MUNGE ( s3 ) , S PERMUTE MUNGE ( s4 ) , \ 
S PERMUTE MUNGE ( s5 ) , S PERMUTE MUNGE ( s6 ) , \ 
S PERMUTE_MUNGE ( s7 ), S_PERMUTE_MUNGE ( s8 ) 



ttdefine C_PERMUTE_MASK ( cl, 
C9, 

.byte C PERMUTE MUNGE ( Cl ) 
C PERMUTE MUNGE ( C3 ) , 
C PERMUTE MUNGE ( c5 ) , 
C PERMUTE MUNGE ( c7 ) , 
C PERMUTE MUNGE ( c9 ) , 
C PERMUTE MUNGE ( ell ) 
C PERMUTE MUNGE ( cl3 ) 
C PERMUTE_MUNGE ( Cl5 ) 



c2, c3, c4, c5, c6, c7, 
CIO, ell, Cl2, Cl3, cl4 
C PERMUTE MUNGE ( c2 ) , 
C PERMUTE MUNGE { c4 ) 
C PERMUTE MUNGE ( c6 ) 



c8, \ 
Cl5, 

\ 
\ 
\ 



C PERMUTE MUNGE ( c8 ) , \ 
C PERMUTE MUNGE ( clO ), \ 
C PERMUTE MUNGE ( Cl2 ), \ 
C PERMUTE MUNGE ( cl4 ), \ 
C_PERMUTE_MUNGE ( cl6 } 



* macro for a microcode entry point (e.g. vaddx, vaddx_) 

* U_ENTRY is a "nop" for C code 

#define U ENTRY ( func_name ) \ 
.globl Eunc__name; \ 
f unc_name : 

/* 

* macros for C function prototypes 
*/ 

#def ine C PROTOTYPE 0 ( func name 
#def ine C PROTOTYPE 1 ( func name 
#def ine C PROTOTYPE 2 ( func name 
#def ine C PROTOTYPE 3 ( func name 
#define C PROTOTYPE 4( func name 
#define C PROTOTYPE 5( func name 
#def ine C PROTOTYPE 6 ( func name 
#def ine C PROTOTYPE 7 ( func name 
#def ine C PROTOTYPE 8 ( func name 
#def ine C PROTOTYPE 9 ( func name 
#define C PROTOTYPE 10 ( func name ) 
#define C PROTOTYPE 11 ( func name ) 
#def ine C PROTOTYPE 12 ( func name ) 
#define C PROTOTYPE 13 ( func name ) 
#def ine C PROTOTYPE 14 ( func name ) 
#define C PROTOTYPE 15 ( func name ) 
#def ine C_PROTOTYPE_16 ( f unc_name ) 

1 ' * macros for C and Fortran callable entry points 

ttdefine ENTRY 0( func_name ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 1( func_name, argO ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 2( func_name, argO, argl ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 3( func_name, argO , argl, arg2 ) \ 
.globl func_name; \ 
f unc_name : 

#define ENTRY 4( func_name, argO, argl, arg2 , arg3 ) \ 
.globl func_name; \ 
f unc_name : 
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#define ENTRY 5( func_name, argO, argl, arg2, arg3, arg4 ) \ 
.globl func_name; \ 
f unc_name : 

ttdefine ENTRY 6( func_name, argO, argl, arg2, arg3, arg4 , arg5 ) \ 
.globl func_name; \ 
f unc_name : 

^define ENTRY 7( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 
arg6 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_8 ( func_name, argO, argl, arg2 , arg3, arg4 , arg5, \ 
arg6, arg7 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_9 ( func_name, argO, argl, arg2, arg3 , arg4 , arg5, \ 
arg6, arg7, arg8 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_10 ( func_name, argO , argl, arg2, arg3 , arg4, arg5, \ 
arg6, arg7, arg8, arg9 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_11 ( func_name, argO, argl, arg2, arg3, arg4 , arg5, \ 
arg6, arg7 , arg8, arg9, arglO ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY 12 ( func_name, argO , argl, arg2, arg3 , arg4 , arg5, \ 

arg6, arg7, arg8, arg9 , arglO, argil ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY 13 ( func_name, argO, argl, arg2, arg3, arg4, arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY_14 ( func_name, argO, argl, arg2 , arg3 , arg4 , arg5, \ 

arg6, arg7, arg8 , arg9, arglO, argil, \ 
argl2, argl3 ) \ 

.globl func_name; \ 
f unc_name : 

#define ENTRY 15 ( func_name, argO, argl, arg2, arg3 , arg4 , arg5, \ 

arg6, arg7, arg8, arg9, arglO, argil, \ 
argl2, argl3, argl4 ) \ 



ttdefine ENTRY 16 ( func_name, argO, argl, arg2 , arg3 , arg4, arg5, \ 

arg6, arg7 , arg8 , arg9, arglO, argil, \ 
argl2, argl3, argl4, argl5 ) \ 

.globl func_name; \ 
f unc_name : 

1 * macros to de-reference any set of the first 8 arguments 

* passed by reference to the Fortran entry point but by 

* value to the corresponding C entry point 
*/ 
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#def ine FORTRAN DREF 1 ( argO ) \ 
lwz argO , 0 (argO) ; 

ttdefine FORTRAN DREF 2( argO , argl ) \ 
lwz argO, O(argO); \ 
lwz argl, 0 (argl) ; 

#define FORTRAN DREF 3( argO, argl, arg2 ) \ 
lwz argO, O(argO); \ 
lwz argl, 0 (argl) ; \ 
lwz arg2, 0 (arg2) ; 

ttdefine FORTRAN DREF 4( argO , argl, arg2, arg3 ) \ 
lwz argO, 0 (argO) ; \ 
lwz argl, 0 (argl) ; \ 
lwz arg2, 0 (arg2) ; \ 
lwz arg3 , 0 (arg3) ; 

Jjdefine FORTRAN DREF 5( argO, argl, arg2, arg3 , arg4 ) \ 
lwz argO, 0 (argO) ; \ 
lwz argl, 0 (argl) ; \ 
lwz arg2, 0 (arg2) ,- \ 
. lwz arg3, 0 (arg3) ; \ 

lwz arg4, 0(arg4); 

13 #define FORTRAN DREF 6( argO, argl, arg2, arg3 , arg4 , arg5 ) \ 

; "fi lwz argO, O(argO); \ 

lwz argl, 0 (argl) ; \ 

lwz arg2, 0(arg2); \ 
%fj lwz arg3, 0(arg3); \ 

A lwz arg4, 0(arg4); \ 

^ lwz arg5, 0 (arg5) ; 

ai #define FORTRAN DREF 7( argO, argl, arg2 , arg3 , arg4 , arg5, arg6 ) \ 

j*=i lwz argO, 0 (argO) ; \ 

r! lwz argl, O(argl); \ 

m lwz arg2, 0(arg2); \ 

Ms lwz arg3, 0(arg3); \ 

lwz arg4, 0(arg4); \ 
lwz arg5, 0(arg5); \ 
lwz arg6, 0(arg6); 

#define FORTRAN DREF 8( argO, argl, arg2 , arg3 , arg4 , arg5, arg6, arg7 ) \ 
lwz argO, O(argO); \ 
lwz argl, 0 (argl) ; \ 
lwz arg2, 0 (arg2) ; \ 
lwz arg3, 0 (arg3) ; \ 
lwz arg4, 0 (arg4) ; \ 
lwz arg5, 0 (arg5) ,- \ 
lwz arg6, 0(arg6); \ 
lwz arg7, 0 (arg7) ; 

1 * macros to de-reference specific arguments beyond the first 8 
* passed by value to the C entry point 
*/ 

#define ARG_OFF (8 - 8*4) 

#define FORTRAN DREF_ARG8 \ 

lwz rl2, (ARG OFF + 8*4) (sp) ; \ 
lwz rl2, 0 (rl2) ; \ 
stw rl2, (ARG_OFF + 8*4) (sp) ; 

#define FORTRAN DREF_ARG9 \ 

lwz rl2, (ARG OFF + 9*4) (sp) ; \ 
lwz rl2, 0 (rl2) ; \ 
stw rl2, ( ARG__OFF + 9*4) (sp); 
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#define FORTRAN DREF_ARG10 \ 

lwz rl2, (ARG OFF + 10*4) (sp) ; \ 

lwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG_OFF + 10*4) (sp) ; 

#define FORTRAN DREF_ARG11 \ 

lwz rl2, (ARG OFF + 11*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

Stw rl2, (ARG_OFF + 11*4) (sp) ; 

ttdefine FORTRAN DREF_ARG12 \ 

lwz rl2, (ARG OFF + 12*4) (sp) ; \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 12*4) (sp) ; 

ttdefine FORTRAN DREF_ARG13 \ 

lwz rl2, (ARG OFF + 13*4) (sp) ; \ 

lwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG_OFF + 13*4) (sp) ; 

#define FORTRAN DREF_ARG1 4 \ 

lwz rl2, (ARG OFF + 14*4) (sp) ; \ 

lwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG_OFF + 14*4) (sp) ; 

#define FORTRAN DREF_ARG15 \ 

lwz rl2, (ARG OFF + 15*4) (sp); \ 

lwz rl2, 0(rl2) ; \ 

stw rl2, (ARG_OFF + 15*4) (sp) ; 

#define FORTRAN DREF_ARG1 6 \ 

lwz rl2, (ARG OFF + 16*4) (sp) ; \ 

lwz rl2, 0 (rl2) ; \ 

Stw rl2, (ARG_OFF + 16*4) (sp) ; 

#define FORTRAN DREF_ARG17 \ 

lwz rl2, (ARG OFF + 17*4) (sp); \ 

lwz rl2, 0 (rl2) ; \ 

stw rl2, (ARG_OFF + 17*4) (sp) ; 



macros to get GPR arguments beyond : 



*/ 

#def ine GET ARG 8 ( rD ) 
ttdefine GET ARG9 ( rD ) 
#def ine GET ARG10 ( rD ) 
ttdefine GET ARG11 ( rD ) 
ttdefine GET ARG12 ( rD ) 
ttdefine GET ARG13 ( rD ) 
ttdefine GET ARG 14 ( rD ) 
ttdefine GET ARG15 ( rD ) 
ttdefine GET ARG16 ( rD ) 
ttdefine GET_ARG17 ( rD ) 



lwz rD, 
lwz rD, 
lwz rD, 
lwz rD , 
lwz rD, 
lwz rD, 
lwz rD, 
lwz rD, 
lwz rD, 
lwz rD, 



(ARG OFF - 
(ARG OFF - 
(ARG OFF - 
(ARG OFF - 
(ARG OFF - 
(ARG OFF • 
(ARG OFF ■ 
(ARG OFF ■ 
(ARG OFF ■ 
(ARG_OFF ■ 



8*4) (sp) ; 
9*4) (sp) ; 
10*4) (sp) ; 
11*4) (sp) ; 
12*4) (sp) ; 
13*4) (sp) ; 
14*4) (sp) ; 
15*4) (sp) ; 
16*4) (sp) ; 
17*4) (sp) ; 



/* 



macros to set GPR arguments beyond I 



*/ 

ttdefine 


SET ARG 8 ( rD ) 




rD, 


(ARG 


OFF 




8*4) (sp) ; 


ttdefine 


SET ARG 9 ( rD ) 


stw 


rD, 


(ARG 


OFF 




9*4) (sp) ; 


ttdefine 


SET ARG10 ( rD ) 


stw 


rD, 


(ARG 


OFF 




10*4) (sp) ; 


ttdefine 


SET ARG11 ( rD ) 


stw 


rD, 


(ARG 


OFF 




11*4) (sp) ; 


ttdefine 


SET ARG12 ( rD ) 


stw 


rD, 


(ARG 


OFF 




12*4) (sp) ; 


ttdefine 


SET ARG13 ( rD ) 




rD, 


(ARG 


OFF 




13*4) (sp) ; 


ttdefine 


SET ARG14 ( rD ) 


stw 


rD, 


(ARG 


OFF 




14*4) (sp) ; 


ttdefine 


SET ARG15 ( rD ) 


stw 


rD, 


(ARG 


OFF 




15*4) (sp) ; 


ttdefine 


SET ARG 16 ( rD ) 


stw 


rD, 


(ARG 


OFF 




16*4) (sp) ; 
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#define SET_ARG17 { rD ) stw rD, (ARG_OFF + 17*4) (sp) ; 

' ' * macro to branch from one entry point to another 
*/ 

#define BR FUNC ( func_name ) \ 
b func_name; 

/* 

* macros to call functions 

ttdefine CALL FUNC( func_name ) \ 
bl f unc_name ; 

#def ine CALL 0 ( func name ) \ 
CALL_FUNC( func_name ) 

argO ) \ 

#define CALL 2( func name, argO, argl ) \ 
j. CALL_FUNC( func_name ) 

113 #define CALL 3( func name, argO, argl, arg2 ) \ 

|;3 CALL_FUNC( func_name ) 

y f #define CALL 4( func name, argO, argl, arg2 , arg3 ) \ 

CALL_FUNC( func_name ) 

S fldefine CALL 5( func name, argO, argl, arg2, arg3, arg4 ) \ 
Jjf CALL_FUNC( func_name ) 

g ^define CALL 6( func name, argO, argl, arg2 , arg3, arg4, arg5 ) \ 
ps. CALL_FUNC( func_name ) 

IaI ^define CALL 7( func name, argO, argl, arg2, arg3, arg4, arg5, arg6 ) \ 
|^ CALL_FUNC( func_name ) 

argO, argl, arg2 , arg3, arg4 , arg5, arg6 , arg7 ) \ 

#define CALL_9 ( func_name, argO, argl, arg2, arg3, arg4, arg5, arg6 , arg7, \ 
arg8 ) \ 
CALL_FUNC( func_name ) 

#define CALL_10 ( func name, argO, argl, arg2 , arg3 , arg4, arg5, arg6, arg7, \ 
arg8, arg9 ) \ 
CALL_FUNC( func_name ) 

#define CALL_11 ( func name, argO, argl, arg2 , arg3 , arg4 , arg5, arg6, arg7, \ 
arg8, arg9, arglO ) \ 
CALL_FUNC( func_name ) 

ttdefine CALL_12 ( func name, argO , argl, arg2, arg3, arg4, arg5, arg6 , arg7, \ 
arg8, arg9, arglO, argil ) \ 
CALL_FUNC( func_name ) 

#define CALL_13 ( func name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7 , \ 
arg8, arg9, arglO, argil, argl2 ) \ 
CALL_FUNC{ func_name ) 

#define CALL_14 ( func name, argO , argl, arg2, arg3 , arg4, arg5, arg6 , arg7, \ 
arg8, arg9, arglO, argil, argl2, arg!3 ) \ 
CALL_FUNC( func_name ) 
#define CALL_15( func_name, argO, argl, arg2, arg3, arg4 , arg5, arg6, arg7, \ 
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arg8, arg9, arglO, 
CALL_FUNC( func_name ) 



argil, argl2, argl3, argl4 ) \ 



#define CALL_16 ( func name, argO , argl, arg2 , arg3 , arg4 , arg5, arg6, arg7, 
arg8, arg9, arglO, argil, argl 2 , argl3, argl4, argl5 ) \ 
CALL_FUNC ( f unc_name ) 

4|±f defined { BUILD MAX ) 

#if defined ( COMPI LE_E SAL_ JUMP__TABLE ) 



* G4 macros to create an ESAL jump table for 1, 2, 3 and 4 vector 

* algorithms. The table name is <root_name>_jump and is made a 

* local symbol, (not supported in C) 

#define DECLARE VMX_V1 ( root_name ) \ 
.section .rodata; \ 
.align 5; \ 

CONCAT( root name, jump ) : \ 
.long CONCAT( root name, 
.long CONCAT ( root_name, 



C ) ; 



#define DECLARE VMX_V2 ( root_name ) 
.section .rodata; \ 
.align 5; \ 

C0NCAT( root name, jump ): \ 
.long CONCAT( root name, nn ) ,- \ 
.long CONCAT( root name, nc ) ; \ 
.long C0NCAT( root name, en ) ; \ 
.long CONCAT( root_name, _cc ) ; 

#def ine DECLARE VMX_V3 ( root_name ) 
.section .rodata,- \ 
.align 
CONCAT ( 
. long 
.long 
.long 

. long 
. long 
.long 

#def ine DECLARE VMX_V4 ( root_name ) 
.section .rodata; \ 



root name , 


jump 




\ 


CONCAT ( 


root 


name, 


nnn ) 


CONCAT ( 


root 


name, 




\ 


CONCAT ( 


root 






\ 


CONCAT ( 


root 


name, 




\ 


CONCAT ( 


root 




cnn ) 


\ 


CONCAT ( 


root 


name, 


cnc ) 


\ 


CONCAT ( 


root 


name, 


ccn ) 


\ 


CONCAT ( 


root_ 


name , 


_ccc ) 





CONCAT ( root ne 




jump 


: \ 




.long 


CONCAT ( 


root 


name. 


nnnn ) 


\ 


. long 


CONCAT ( 


root 


name, 


nnnc ) 


\ 


. long 


CONCAT ( 


root 


name, 


nncn ) 


\ 


. long 


CONCAT ( 


root 




nncc } 


\ 




CONCAT { 


root 






\ 


. long 


CONCAT ( 


root 


name, 


ncnc } 


\ 


. long 


CONCAT ( 


root 


name, 


nccn ) 


\ 


. long 


CONCAT ( 


root 


name, 


nccc ) 


\ 


. long 


CONCAT ( 


root 






\ 


. long 


CONCAT { 




name, 


cnnc ) 


\ 


. long 


CONCAT ( 


root 






\ 


. long 


CONCAT ( 




name, 


cncc ) 


\ 


. long 


CONCAT ( 


root 






\ 


.long 


CONCAT { 


root 






\ 


.long 


CONCAT ( 


root 






\ 


.long 


CONCAT { 


root_ 




_cccc ) 





#define DECLARE VMX_V5 ( root_name ) \ 
.section .rodata; \ 
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.align 
CONCAT 
.long 
. long 
. long 
.long 
.long 
. long 
.long 
.long 
.long 
. long 
.long 
. long 
. long 
. long 
.long 
. long 
.long 
.long 
.long 
. long 
. long 
.long 
. long 
. long 
. long 
. long 
. long 
. long 
. long 
. long 
. long 
. long 



5; \ 

{ root name, 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 

CONCAT ( root 
CONCAT ( root 
CONCAT ( root 
CONCAT ( root 



jump ) 
name, 
name, 
name, 
name, 
name, 
name, 
name, 
name, 

name 
name 
name 
name 
name, 
name, 
name 
name 

name 



name 
name, 
name, 
name 
name 

name 
name 
name 



nnnnc 
nnncn 
nnncc 
nncnn 
nncnc 
nnccn 
nnccc 
ncnnn 
ncnnc 
ncncn 
ncncc 
nccnn 
nccnc 
ncccn 
ncccc 
cnnnn 
cnnnc 
cnncn 



cccnc 
ccccn 
ccccc 



#define DECLARE VMX Zl ( root name ) 

#def ine DECLARE VMX Z2 ( root name ) 

#def ine DECLARE VMX Z3 ( root name ) 

#def ine DECLARE VMX Z4 { root name ) 

#def ine DECLARE_VMX_Z5 ( root_name ) 



DECLARE VMX VI ( root name 
DECLARE VMX V2 ( root name 
DECLARE VMX V3 { root name 
DECLARE VMX V4 ( root name 
DECLARE VMXJV5 ( root_name 



/* 



G4 macros to branch through the <root name> jump table based on 
the value of the ESAL flag, (not supported in C) 
(uses rO as scratch and destroys eflag) 
(not supported in C) 



#define BR ESAL_JUMP TABLE_COMMON { root name, rtemp ) 
addis rtemp, 0, CONCAT ( root name, jump@ha ); \ 
addi rtemp, rtemp, CONCAT ( root_name, _jump@l ) ; \ 
lwzx rtemp, rtemp, rO; \ 
mtctr rtemp; \ 
bctr; 

#define BR VMX VI ( root_name, eflag, rtemp ) \ 
rlwinm rO, eflag, 2, 29, 29; \ 

BR_ESAL_JUMP_TABLE_COMMON ( root_name, rtemp ) 

#define BR VMX V2 { root_name, eflag, rtemp ) \ 
rlwinm rO , eflag, 2, 28, 29; \ 

BR_ESAL_JUMP_TABLE_COMMON ( root_name, rtemp ) 



\ 



#define BR VMX V3 ( root_name, eflag 
rlwinm rO , eflag, 2, 27, 2 9; \ 
BR_ESAL_JUMP__TABLE_COMMON ( root_i 



#define BR_VMX_V4 ( root_name, 



rtemp ) \ 
ime, rtemp ) 
eflag, rtemp ) \ 
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rlwinm rO, eflag, 2, 26, 29; \ 
BR ESAL JUMP_TABLE_COMMON ( root_: 



#def ine BR VMX V5 ( root_: 
rlwinm rO , eflag. 



eflag, 

\ 



ame , rtemp ) 
rterap ) \ 



BR_ESAL_ 


_JUMP_TABLE_COMMON ( 


root_name , rtemp ) 


#def ine 


BR 
BR_ 


VMX Zl( 
_VMX_V1 ( 


root name, 
root_name , 


eflag, 
eflag, 


rtemp ) 
rtemp ) 


\ 


#define 


BR 
BR_ 


VMX Z2 ( 
_VMX_V2 ( 


root name, 
root_name , 


eflag, 
eflag, 


rtemp ) 
rtemp ) 


\ 


tfdefine 


BR 
BR_ 


VMX Z3 ( 
_VMX_V3 ( 


root name, 
root_name , 


eflag, 
eflag. 


rtemp 
rtemp 


\ 


fldef ine 


BR 
BR_ 


VMX Z4 ( 
_VMX_V4 ( 


root name , 
root name , 


eflag, 
eflag. 


rtemp 
rtemp 


\ 


#def ine 


BR 
BR 


VMX Z5{ 
_VMX_V5 ( 


root name, 
root_name , 


eflag, 
eflag. 


rtemp 
rtemp 


\ 


#else 










/* nc 


D ESAL 



* G4 macros to create a dummy jump table. 

* (not supported in C) 
*/ 

#def ine DECLARE VMX VI { root nai 
#def ine DECLARE VMX V2 ( root name ! 
#def ine DECLARE VMX V3 ( root name ! 
#define DECLARE VMX V4 ( root name 
#def ine DECLARE_VMX_V5 ( root_name 

#define DECLARE VMX Zl ( root name 
#def ine DECLARE VMX Z2 ( root name 
#def ine DECLARE VMX Z3 ( root name 
#def ine DECLARE VMX Z4 ( root name 
#define DECLARE_VMX_Z5 ( root_name 



/* 



G4 macros to simply branch to root_name (no jump table) 
(not supported in C) 



#define BR VMX VI ( root_ 
b root_name; 



eflag, rtemp ) \ 
eflag , rtemp ) \ 
rtemp ) \ 
rtemp ) \ 
rtemp ) \ 



eflag 
eflag 
eflag 



eflag, 
eflag, 



eflag, 
eflag, 



rtemp ) \ 
rtemp ) 



rtemp ) \ 
rtemp ) 



eflag, 
eflag, 
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#define BR VMX Z4 ( root name, eflag, rtemp ) \ 
BR_VMX_V4 ( root_name, eflag, rtemp ) 

#define BR VMX Z5 ( root name, eflag, rtemp ) \ 
BR_VMX_V5 ( root_name, eflag, rtemp ) 

flendif /* end COMPILE_ESAL_JUMP_TABLE */ 

1 * G4 macros to decide whether to enter a VMX loop 

* VMX loop is entered if at least minimum count, 

* all vectors have the same relative alignment 

* (i.e., same lower 4 bits) and all strides are unit. 

* Note, a unit s imm argument is provided because some 

* packed interleaved complex functions (stride 2) such 

* as cvaddxO can be implemented with a VMX loop. 

* Only one macro should be invoked per source file. 

* (uses rO as scratch) 

* (not supported in C) 

#define BR IF VMX VI ( root_name, min_n_imm, unit_s_imm, pi, si, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ _ 
cmpwi si, unit s imm; \ 
bne v skip_vmx; \ 

BR VMX VI ( root_name, eflag, si ) \ 
v_skip_vtnx: 

#define BR_IF_VMX_V1_ALIGNED ( root name, min n_imm, unit_s_imm, \ 
pi, si, n, eflag ) \ 

cmplwi n, min n_imm; \ 



bit v_skip vmx; 
cmpwi si, unit 
bne v_skip vmx; 
andi. rO , pi, Oxf; \ 
bne v skip_vmx; 



> VMX Vll root_name, eflag, si ) \ 
v_skip_vmx : 

ttdefine BR_IF_VMX_V2 ( root name, min n imm, unit_s_imm, \ 
pi, si, p2, s2, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \_ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V2 ( root_name, eflag, si ) \ 
v_skip_vmx : 

#define BR IF_VMX_V2_LS ( root name, min n imm, unit_s_imm, 
pi, si, ps, s2, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ _ 
cmpwi si, unit s imm; \ 
srwi rO, pi, 1; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
xor rO, rO, ps; \ 
bne v_skip vmx; \ 
andi. rO, rO, 0x6; \ 
bne v skip_vmx; \ 

BR VMX_V2 ( root_name, eflag, si ) \ 
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_n imm, unit_ 
eflag ) \ 



ttdefine BR_IF_VMX_V2_LC ( root name, 
pi, si, pc 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
andi . rO , pc , 1 ,- \ 
bne v_skip vmx; \ 
cmpwi si, unit s imm; \ 
srwi rO, pi, 2; \ 
bne v skip vmx; \ 
xor rO , rO , pc ; \ 
andi. rO, rO, 0x3; \ 
bne v skip_vmx; \ 

BR VMX V2 ( root_name, eflag, si ) \ 
v_skip_vmx : 

^define BR_IF_VMX_V2_ALIGNED ( root name, min 
pi. Si, p2, S2, 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit_s_imm; \ 
or rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf ,- \ 
bne v skip_vmx; \ 

BR VMX V2 ( root_name, eflag, si ) \ 
v_skip_vmx : 

#define BR IF VMX_V3 ( root name, min n imm, unit_s imm, \ 
~~ pi, si, p2, s2, p3, s3, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 



imm, unit_s_imm, \ 
n, eflag ) \ 



imm, 



\ 



cmpwi si, unit 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; 
bne v_skip vmx; \ 
cmpwi s3, unit s imm; 
xor rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; 
xor rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR VMX V3{ root_name, eflag, si ) \ 
v_skip_vmx : 

#define BR IF VMX V3 ALIGNED ( root name, min n imm, unit_s imm, \ 
_ ~ ~~ ~ pi, si, p2, s2, p3, S3, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit v_skip vmx; \ _ 
cmpwi si, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne v_skip vmx; \ 
cmpwi s3, unit_s_imm; \ 
or rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 

BR_VMX_V3 ( root_name, eflag, si ) \ 



21 



Page No. 358 



EV 093 931 868 US 

Page No. 385 

salppc.inc 

v_skip_vmx : 
#defi 



BR_IF_VMX_V4 ( root name, min n imm, 
pi, si, p2, s2, p3, £ 
imm; \ 



cmplwi n, min ] 
bit v_skip vmx 
cmpwi si, unit 
bne v_skip vmx. 
cmpwi s2, unit 
bne v_skip vmx 
cmpwi s3, unit 
bne v_skip vmx 
cmpwi s4, unit 
xor rO, pi, p2 
bne v_skip vmx 



unit i 
3, p4, 



s imm; 


\ 


\_ 


\ 




\ 




S imm; 


\ 


\ 


\ 


s imm; 



andi. rO, rO, Oxf; \ 
xor rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 
BR VMX V4 ( root_name , 
v_skip_vmx : 

#define BR IF VMX V4 ALIGNED ( root name, min n imm, unit s imm, \ 

pi, si, p2, s2, p3, s3, p4, s4, n, eflag ) \ 

_imm; \ 



eflag, si ) \ 



cmplwi n, min i 
bit v_skip vmx 
cmpwi si, unit 
bne v_skip vmx 
cmpwi s2, unit 
bne v_skip vmx 
cmpwi s3, unit 
bne v_skip vmx 
cmpwi s4 , unit_s_ 
or rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p3; \ 
bne v_skip vmx; \ 
andi. rO , rO, Oxf; \ 
or rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 
BR VMX V4 ( root_name 
_skip_vmx : 



eflag, si ) \ 



#define BR_IF_VMX_V5 ( root name, min n imm, unit 
pi, si, p2, s2, p3, s3, p4, 

cmplwi n, min i ' 
bit v_skip vmx,- 
cmpwi si, unit 
bne v_skip ^ 
cmpwi s2, unit 
bne v_skip vmx; 
cmpwi s3, unit : 
bne v_skip vmx; 
cmpwi s4, unit 
bne v_skip 1 
cmpwi s5, unit 
xor rO, pi, p2; 
bne v_skip vmx; 
andi. rO 
xor rO, pi, p3 



imm, \ 

34, p5, s5, n, eflag ) \ 



n_imm, 


\ 








\ 


; \_ 






\ 


; \ 




s imm 


\ 


; \ 




s imm 


\ 


; \ 




s imm 


\ 


; \ 




; \ 




Oxf; \ 




; \ 
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bne v_skip vmx; \ 
andi. rO, rO, Oxf ; \ 
xor rO, pi, p4; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, pi, p5; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 
BR VMX V5 ( root_name , 
v_skip_vmx : 



eflag, si ) \ 



#defir 



. BR_IF_VMX_V5_ALIGNED ( root_nam< 
pi, Sl, p2. 



min n_imm, unit s_imm, \ 
i2, p3, S3, p4, s4, p5, s5, 



cmplwi n, min n_imm; 
bit v_skip vmx; \ _ 
cmpwi sl, unit s imm, 
bne v_skip vmx; \ 
cmpwi s2, unit s imm, 
bne v_skip vmx; \ 
cmpwi s3, unit s imm 
bne v_skip vmx; \ 
cmpwi s4, unit s imm 
bne v_skip vmx; \ 
cmpwi s5, unit_s_imm 
or rO, pi, p2; \ 
bne v_skip vmx; \ 
andi. rO, rO, Oxf; \ 
or rO, pi, p3 
bne v_skip vm 
andi. rO, rO, 
or rO, pi, p4 
bne v_skip vi 
andi . rO , rO 
■ rO, pi, p5; 



\ 

Oxf ; 

; \ 
x; \ 
Oxf; 
\ 



bne v_skip 
andi. rO, rO, Oxf; \ 
bne v skip_vmx; \ 
BR VMX V5{ root_name 
_skip_vmx: 



eflag, sl ) \ 



#define BR_IF_VMX_Z1 ( root_name, min n_xmm, umt_s_imm, \ 
prl, pil, sl, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi sl, unit s imm; \ 
xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Zl ( root_name, eflag, sl ) \ 
z_skip_vmx : 

#define BR IF VMX_Z2 ( root_name, min n imm, unit s imm, \ 

prl, pil, sl, pr2, pi2, s2, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 



\ 



imm; 



cmpwi sl, uni 
bne z_skip vmx 
cmpwi s2, unit 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2 ; \ 
bne z_skip vmx; \ 
andi. rO, rO , Oxf; \ 



imm; \ 
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eflag, si ) \ 



min n imm, unit s imm, \ 
Sl, pr2, pi2, s2, pr3, pi3, s3, 



xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 
BR VMX Z2( root_name, 
z_skip__vmx : 

#def ine BR_IF_VMX_Z3 ( root_name, 
prl, pil, 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi sl, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s3, unit s imm; \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr3 ; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Z3 ( root_name, eflag, sl ) \ 
z_skip_vmx : 

#define BR IF VMX_Z4 { root_name, min n imm, unit s imm, \ 

_ prl, pil, sl, pr2, pi2, s2, pr3 , pi3, s3, 

pr4, pi4, s4, n, eflag ) \ 
cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi sl, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s2, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s3, unit s imm; \ 
bne z_skip vmx; \ 
cmpwi s4 , unit s imm; \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 



eflag ) \ 



bne 
andi 
xor 



andi. rO, rO , Oxf; \ 
rO, prl, pr2; \ 
z_skip vmx; \ 

" rO, Oxf; \ 
rO, prl, pi2; \ 
z_skip vmx; \ 
andi. rO , rO, Oxf; \ 
rO, prl, pr3; \ 
z_skip vmx; \ 
. rO, rO, Oxf; \ 
rO, prl, pi3; \ 
z_skip vmx; \ 
.. rO, rO, Oxf; \ 
xor rO, prl, pr4 ; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi4; \ 
bne z_skip_vmx; \ 
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andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Z4( root_name, eflag, si ) \ 
z_skip_vmx : 

#define BR IF_VMX_Z5 ( root_narae, min n imm, unit s imm, \ 

prl, pil, Sl, pr2, pi2, s2, pr3, pi3, s3, 



cmplwi n, min n_imm; 
bit z_skip vmx; \ 
cmpwi sl, unit s imm 
bne z_skip vmx; \ 
cmpwi s2, unit s imm 
bne z_skip vmx; \ 
cmpwi s3, unit s imm 
bne z_skip vmx; \ 
cmpwi s4, unit s imm, 
bne z_skip vmx; \ 
cmpwi s5, unit s imm 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi2; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pr3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf 
xor rO, prl, pi3; 
bne z_skip vmx, 



s4, pr5, pi 5, 



\ 



Oxf, 
pr4 ; 



andi 

xor rO, prl 
bne z_skip * 
andi. rO, n 
xor rO, prl 
bne z_skip ■ 
andi. rO, r 1 
xor rO, prl 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi5; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 
BR VMX Z5 ( root_name 
_skip_vmx: 



' , Oxf ; 

pi4 ; 
mx; \ 

I, Oxf; 
pr5; 



eflag, sl ) \ 



#define BR_IF_VMX_CONV ( root name, min n imm, 
pi, sl, s2, p3, s3, n, 
cmplwi n, min n_imm; \ 
bit v_skip vmx; \ 
cmpwi sl, 1; \ 
bne v_skip vmx; \ 
cmpwi s2, 1; \ 
beq PC OFFSET ( 12 ); \ 
cmpwi s2, -1; \ 
bne v_skip vmx; \ 
cmpwi s3, 1; \ 
xor rO, pi, p3; \ 
bne v_skip vmx; \ 
andi . rO , rO , Oxf ,- \ 
bne v skip_vmx; \ 

BR VMX V3 ( root_name, eflag, sl ) \ 
v_skip_vmx : 
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#define BR_IF_VMX_ZCONV { root_name, min n imm, \ 

prl, pil, si, s2, pr3, pi3, s3, n, eflag ) \ 

cmplwi n, min n_imm; \ 
bit z_skip vmx; \ 
cmpwi si, 1; \ 
bne z_skip vmx; \ 
cmpwi s2, 1; \ 
beq PC OFFSET ( 12 ) ; \ 
cmpwi s2, -1; \ 
bne v_skip vmx; \ 
cmpwi s3, 1; \ 
xor rO, prl, pil; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf ; \ 
xor rO, prl, pr3 ; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
xor rO, prl, pi3; \ 
bne z_skip vmx; \ 
andi. rO, rO, Oxf; \ 
bne z skip_vmx; \ 

BR VMX Z3 ( root_name, eflag, si ) \ 
z_skip_vmx: 

/* 

* G4 macro to get VMX unaligned word (FP) count 

* assumes that the last 2 bits of ptr are 0 

* sets condition code CRO 

#define GET VMX UNALIGNED_COUNT ( count, ptr ) \ 
neg count, ptr; \ 

rlwinm. count, count, 30, 30, 31; 

^ * G4 macro to get VMX unaligned short count 

* assumes that the last bit of ptr is 0 

* sets condition code CRO 

#define GET VMX UNALIGNED_COUNT_S ( count, ptr ) \ 
neg count, ptr; \ 

rlwinm. count, count, 31, 29, 31; 

1 * G4 macro to get VMX unaligned char count 

* sets condition code CRO 

#define GET VMX UNALIGNED_COUNT_C ( count, ptr ) \ 
neg count, ptr; \ 
rlwinm. count, count, 0, 28, 31; 

/ * G4 macro to load and splat an FP scalar independent of alignment 
*/ 

#if defined ( LITTLE END IAN ) 

#define SCALAR_SPLAT ( vt , vtmp, scalarp ) \ 

lvxl vt, 0, scalarp; \ 

lvsr vtmp, 0, scalarp; \ 

vperm vt, vt, vt, vtmp; \ 

vspltw vt , vt , 3 ; 
#else 

#define SCALAR_SPLAT ( vt , vtmp, scalarp ) \ 
lvxl vt, 0, scalarp; \ 
lvsl vtmp, 0, scalarp; \ 
vperm vt , vt , vt , vtmp ; \ 
vspltw vt , vt , 0 ; 
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/* 



G4 macro to construct an FP absolute value mask that can be used with 

* vand to take the absolute value of 4 FP numbers in a vector register 

* vt = 0x7fffffff7fffffff7fffffff7fffffff 

#define MAKE VABS MASK ( vt ) \ 
vspltisw vt, -1; \ 
vslw vt, vt, vt; \ 
vnor vt, vt, vt; 

/* 

* G4 macro to construct an FP sign mask that can be used with: 

* vandc to take the absolute value of 

* vor to take the negative absolute value of 

* vxor to negate 

* 4 FP numbers in a vector register 

* vt = 0x80000000800000008000000080000000 

#define MAKE VSIGN_MASK( vt ) \ 
vspltisw vt, -1; \ 
vslw vt, vt, vt; 

' ' * G4 macros to construct a coded touch stream control register 

* "I" indicates argument is passed as an immediate value 

* "R" indicates argument is passed in an integer register 

* bytes_per block = # of bytes in each block 

* (0 = 512, 16, 32, 480, 512) 

* block count = # of blocks (0 = 256, 1, 2, 3, ... 256) 

* byte stride = signed byte stride between start of adjacent blocks 

* (-32768 <= byte_stride < 0 ; 0 = 32768; 0 < byte_stride < 32768) 

#define MAKE_STREAM_CODE_III ( rB, bytes_j>er_block , block_count, byte_stride ) 

lis rB, ((((bytes per block) >> 4) & 31) << 8) | ( (block_count) & 255); \ 
ori rB, rB, ( (byte_stride) & OxOOOOffff) ; 

#define MAKE STREAM CODE ( rB, bytes per block, block count, byte stride ) \ 

MAKE_STREAM_CODE_III ( rB , bytes _per_block, block_count, byte_stride ) 

ttdefine MAKE_STREAM_CODE_IIR ( rB, bytes_per_block, block_count, byte_stride ) 

^ lis rB, ((((bytes per block) » 4) & 31) << 8) | ( (block_count) & 255); \ 
rlwimi rB, byte_stride, 0, 16, 31; 

ttdefine MAKE_STREAM_CODE_IRI ( rB, bytes_per_block, block_count, byte_stride ) 
rlwinm rB, block count, 16, 8, 15; \ 

oris rB, rB, ((((bytes per_block) » 4) & 31) << 8); \ 
ori rB, rB, { (byte_stride) & OxOOOOffff); 

#define MAKE_STREAM_CODE_IRR ( rB, bytes _joer_block, block_count, byte_stride ) 

\ 

rlwinm rB, block count, 16, 8, 15; \ 

oris rB, rB, ((((bytes per_block) >> 4) & 31) << 8); \ 
rlwimi rB, byte_stride, 0, 16, 31; 

#define MAKE_STREAM_CODE_RII ( rB, bytes_per_block, block_count, byte_stride ) 

rlwinm rB, bytes per block, 20, 3, 7; \ 
oris rB, rB, ((block count) & 255); \ 
ori rB, rB, ( (byte_stride) & OxOOOOffff) ; 

#define MAKE_STREAM_CODE_RIR ( rB, bytes_per_block, block_count, byte_stride ) 
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\ 

rlwinm rB, bytes per block, 20, 3, 7; \ 
oris rB, rB, {(block count) & 255); \ 
rlwimi rB, byte_stride, 0, 16, 31; 

#define MAKE_STREAM_CODE_RRI ( rB, bytes_per_block, block_count, byte_stride ) 

rlwinm rB, bytes per block, 20, 3, 7; \ 

rlwimi rB, block count, 16, 8, 15; \ 

ori rB, rB, ( (byte_stride) & OxOOOOffff ) ; 

#define MAKE_STREAM_CODE_RRR ( rB, bytes_per_block, block_count, byte_stride ) 

\ 

rlwinm rB, bytes per block, 20, 3, 7; \ 
rlwimi rB, block count, 16, 8, 15; \ 



rlwimi rB, byte_stride, 0, 16, 



/* end BUILD_MAX */ 



#define CACHE TB THRESHOLD 1 /* 2 TB ticks = 12 CPU 100 MHz elks */ 

#define INSTRUCTION CACHE COUNT 3 /* min. to fully cache instructions */ 

#define POSTING_BUFFER_COUNT 10 /* min. to fill posting buffer */ 

O 1 ' * macros to set DCBx conditions explicitly 

'2 ttdefine DCBT TRUE ( cond_bit , scratch ) \ 

! <=* li scratch, 0; \ 

C 5 cmplwi (cond_bit) , scratch, 1; 

f-, #define DCBZ TRUE ( cond_bit, scratch ) \ 

DCBT TRUE ( cond bit, scratch ) 

m - 

2i #define DCBT FALSE ( cond_bit, scratch ) \ 

p 5 li scratch, 2; \ 

W cmplwi (cond_bit) , scratch, 1; 

jtl #define DCBZ FALSE ( cond_bit , scratch ) \ 

r l DCBT_FALSE( cond_bit, scratch ) 

O /* 

p f * This macro will cause a file not to assemble. 

#define DO_NOT_ASSEMBLE add scratchl, scratch2, 256; 

1 * Obsolete macro will cause assembler error 

#define TEST IF CACHABLE ( cond_bit, buffer, scratchl, scratch2 ) \ 
DO_NOT_ASSEMBLE 

* Obsolete macro will cause assembler error 
*/ 

#define TEST IF CACHABLE_ALIGN ( cond_bit, buffer, scratchl, scratch2 ) \ 
DO_NOT_ASSEMBLE 

1 ' * macros to test if a DCBT or DCBZ instruction should be performed on 

* a particular buffer based on a bit test (cache bit) on a specified 

* ESAL flag. 
*/ 

#define TEST_IF_DCBT ( cond_bit, cache_bit, eflag, bufer, scratchl, scratch2 ) 

\ 

DO_NOT_ASSEMBLE 

#define SET_DCBT_COND ( cond_bit, cache_bit , eflag, scratchl ) \ 
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andi. scratchl, eflag, (cache bit); \ 
cmplwi (cond_bit) , scratchl, 0; 

only one is true 

* Ins. 1-3 Set both conditions to "No DCBT" 

* Ins. 4 See if vecl has a C 

* Ins. 5 Set DCBT condl 

* Ins. 6 Branch if "DCBT TRUE" (eflag & bitl = 0) 

* Ins. 7-8 Set DCBT cond2 

#define SET_2_DCBT_C0ND ( condl bit, cache_bitl, cond2_bit, cache_bit2, \ 
eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

andi. scratch, eflag, (cache_bitl) ; \ 

cmplwi (condl bit), scratch, 0; \ 

be 12, ( (condl_bit) <<2) +2, PC OFFSET ( 12 ) ; \ 

andi. scratch, eflag, (cache_bit2) ,- \ 

cmplwi (cond2_bit) , scratch, 0 ; 

' * Set 3 debt conditions and ensure only one is true 

* Logic is the similar to SET_2_DCBT_COND ( ) macro 

#define SET 3 DCBT_COND ( condl bit, cache bitl, cond2 bit, cache_bit2, \ 
cond3_bit, cache_bit3, eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

cmplwi (cond3 bit), scratch, 1; \ 

andi. scratch, eflag, (cache_bit3) ,- \ 

cmplwi (cond3 bit), scratch, 0; \ 

be 12, ( (cond3_bit)<<2)+2, PC OFFSET ( 24 ) ; \ 

andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi (cond2 bit), scratch, 0; \ 

be 12, ( (cond2_bit)«2)+2, PC OFFSET ( 12 ) ; \ 

andi. scratch, eflag, (cache_bitl) ; \ 

cmplwi (condl_bit), scratch, 0; 



* Logic is the similar to SET_2_DCBT_COND ( ) macro 

#define SET 4 DCBT COND ( condl bit, cache bitl, cond2 bit, cache bit2, \ 
cond3 bit, cache_bit3, cond4_bit, cache_bit4, \ 
eflag, scratch ) \ 

li scratch, 2; \ 

cmplwi (condl bit), scratch, 1; \ 

cmplwi (cond2 bit), scratch, 1; \ 

cmplwi (cond3 bit) , scratch, 1; \ 

cmplwi (cond4 bit) , scratch, 1; \ 

andi. scratch, eflag, (cache_bit4) ; \ 

cmplwi (cond4 bit), scratch, 0; \ 

be 12, ( (cond4_bit) «2) +2, PC OFFSET ( 36 ) ; \ 

andi. scratch, eflag, (cache_bit3 ) ; \ 

cmplwi (cond3 bit), scratch, 0; \ 

be 12, ( (cond3_bit) «2) +2, PC OFFSET ( 24 ); \ 

andi. scratch, eflag, (cache_bit2) ; \ 

cmplwi (cond2 bit), scratch, 0; \ 

be 12, ( (cond2_bit) <<2) +2, PC OFFSET ( 12 ) ; \ 

andi. scratch, eflag, (cache_bitl) ,- \ 

cmplwi (condl_bit) , scratch, 0; 
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#if 'defined COMPILE_NO_DCBZ 

fldefine SET_DCBZ_COND ( cond bit, cache bit, eflag, buffer, stride, \ 
unit stride, count, tmpl, tmp2 , tmp3) \ 
andi. tmp3, eflag, (cache bit) ; \ 
cmplwi (cond bit), tmp3, 0; \ 
bne PC_OFFSET( 104 ) ; \ 
cmplwi 1, stride, unit stride; \ 
bne 1, PC_0FFSET( 92 ) ; \ 

cmplwi 1, count, (CACHE_LINE_LSIZE<<unit_stride) ; \ 

bit 1, PC OFFSET ( 84 ) ; \ 

addi tmp2, buffer, CACHE LINE SIZE; \ 

li tmp3, CACHE LINE ADDR_MASK; \ 

and tmp2, tmp2, tmp3 ; \ 

mf cr tmp3 ; \ 

Stw tmp3, CR_SAVE_OFF(sp) ; \ 
mflr tmp3; \ 

Stw tmp3, LR SAVE OFF(sp); \ 

CREATE STACK_FRAME( 0 ) \ 

mr tmpl , r3 ; \ 

mr r3, tmp2 ; \ 
1= bl ppc buf is dcbz safe; \ 

DESTROY STACK FRAME \ 
O lwz tmp3, LR_SAVE_OFF(sp) ; \ 

LJ mtlr tmp3; \ 

J3 lwz tmp3, CR_SAVE_OFF(sp) ; \ 

~Z mtcr tmp3; \ 

li tmp2, 0; \ 
%\ cmplw 1, tmp2, r3; \ 

H mr r3, tmpl; \ 

bne 1 , PC OFFSET ( 8 ) ; \ 
y- cmpwi (cond_bit) , count, -1; 

#define SET DCBZ ALIGN_COND( cond bit, cache bit, eflag, buffer, stride, \ 
W - _ unit stride, count, tmpl, tmp2, tmp3) \ 

W andi. tmp3 , eflag, (cache bit); \ 

;U cmplwi (cond bit), tmp3 , 0; \ 

> bne PC_OFFSET< 100 ); \ 

2! : cmplwi 1, stride, unit stride; \ 

O bne 1, PC_OFFSET( 88 ) ; \ 

F<i cmplwi 1, count, (CACHE_LINE_LSIZE<<unit_stride) ; \ 

bit 1, PC OFFSET ( 80 ) ; \ 
andi. tmp3, buffer, C ACHE__L I NE_MAS K ; \ 
bne PC OFFSET ( 72 ) ; \ 
mfcr tmp3; \ 

stw tmp3, CR_SAVE_OFF(sp) ; \ 
mflr tmp3; \ 

stw tmp3, LR SAVE OFF(sp); \ 

CREATE STACK_FRAME( 0 ) \ 

mr tmpl, r3 ; \ 

mr r3, buffer; \ 

bl ppc buf is dcbz safe; \ 

DESTROY STACK FRAME \ 

lwz tmp3, LR_SAVE_OFF(sp) ; \ 

mtlr tmp3 ; \ 

lwz tmp3, CR_SAVE_OFF(sp) ; \ 

mtcr tmp3 ; \ 

li tmp2, 0; \ 

cmplw 1, tmp2, r3 ; \ 

mr r3, tmpl; \ 

bne 1 , PC OFFSET ( 8 ) ; \ 

cmpwi (cond_bit), count, -1; 

#else /* COMPILE_NO_DCBZ is defined */ 

#define SET_DCBZ_COND ( cond_bit, cache_bit, eflag, buffer, stride, \ 
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unit stride, count, tmpl, tmp2, tmp3) \ 
DCBZ_FALSE ( cond_bit , tmpl ) 

#define SET_DCBZ_ALIGN_COND ( cond bit, cache bit, eflag, buffer, stride, 
unit_stride, count, tmpl, tmp2 , tmp3) \ 
DCBZ_FALSE( cond_bit, tmpl ) 

#endif /* COMPILE_NO_DCBZ */ 

/ 



macro to perform [or skip] a debt instruction based on the result 
of a prior call to TEST IF DCBT (specifying the same condition bit) . 
debt is performed if the cond "<=" is true; otherwise debt is skipped. 



^define DCBT IF( cond bit, rA, rB ) \ 

be 12, { (cond_bit)«2)+l, PC_OFFSET ( 8 ) ; \ 
debt rA, rB; 

' ' * macro to perform [or skip] a debz instruction based on the result 

* of a prior call to TEST IF DCBZ (specifying the same condition bit) . 

* debz is performed if the cond "<=" is true; otherwise debz is skipped. 
*/ 

#if ! defined COMPILE_NO_DCBZ 

#define DCBZ IF( cond bit, rA, rB ) \ 

be 12, ( (cond_bit)<<2)+l, PC_OFFSET( 8 ); \ 
debz rA, rB; 



#else 

#define DCBZ IF ( cond bit, rA, rB ) \ 

be 12, ( (cond_bit)«2)+l, PC_0FFSET ( 8 ),- \ 
nop; 

#endif 

1 '* macro to branch to a label if the buffer specified in a prior 

* call to TEST_IF CACHABLE (also specifying the same condition bit) 

* was cachable (i.e. TB read time was <= CACHE_TB_THRESHOLD) . 

#define BR IF COND TRUE ( cond bit, label ) \ 

be 4, ( (cond_bit) <<2) +1, label; /* <= */ 

/ * macro to branch to a label if the buffer specified in a prior _ 

* call to TEST IF CACHABLE (also specifying the same condition bit) 

* was NOT cachable (i.e. TB read time was > CACHE JTBJTHRESHOLD) . 

ttdefine BR IF COND FALSE ( cond bit, label ) \ 

be 12, ( (cond_bit)«2)+l, label; /* > */ 



/* 

* ASIC macros 
#if defined ( COMPILE_PREFETCH ) 

#define LOAD PREFETCH_CONTROL ( mode, scratchl, scratch2 ) \ 
li scratchl, mode; \ 

addis scratch2, 0, PREFETCH CONTROL H; \ 

stw scratchl, PREFETCH_CONTROL_L ( scratch2 ); 

#define LOAD MISCON B( mode, scratchl, scratch2 ) \ 
li scratchl, mode; \ 
addis scratch2, 0, MISCON_B H; \ 
stw scratchl, MISCON_B_L( scratch2 ); 
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#define RESET PREFETCH CONTROL ( scratchl, scratch2 ) \ 
addis scratch2, 0, ASIC H; \ 
lwz scratchl, MISCON B L( scratch2 ) ; \ 
andi. scratchl, scratchl, PREFETCH MASK; \ 
ori scratchl, scratchl, USE PREFETCH CONTROL; \ 
stw scratchl, PREFETCH_CONTROL_L ( scratch2 ); 



ttdefine LOAD PREFETCH CONTROL ( mode, scratchl, scratch2 ) 
ttdefine LOAD MISCON B( mode, scratchl, scratch2 ) 
#define RESET_PREFETCH_CONTROL ( scratchl, scratch2 ) 



* instruction macros 
*/ 

#define ADD ( rD, rA, rB ) 
#define ADD C( rD, rA, rB ) 
#define ADDI ( rD, rA, SIMM ) 
#define ADDIC C( rD, rA, SIMM ) 
#define ADDIS ( rD, rA, SIMM ) 
#define AND ( rA, rS, rB ) 
#define AND C( rA, rS , rB > 
#define ANDC ( rA, rS, rB ) 
#define ANDC C( rA, rS, rB ) 
#define ANDI C( rA, rS, UIMM ) 
ttdefine ANDIS C( rA, rS, UIMM ) 
#define BA( label ) 
ttdefine BCTR 
#define BCTRL 
#define BEQ ( label ) 
#define BEQ PLUS ( label ) 
#define BEQ MINUS ( label ) 
#define BEQ CR ( bit, label ) 
#define BEQ CR PLUS ( bit, label ) 
#define BEQ CR_MINUS ( bit, label ) 
#define BEQLR 
#define BEQLR PLUS 
ttdefine BEQLR MINUS 
ttdefine BEQLR CR ( bit ) 
ttdefine BEQLR CR PLUS ( bit ) 
#define BEQLR CR MINUS ( bit ) 
ttdefine BGE ( label ) 
ttdefine BGE PLUS ( label ) 
ttdefine BGE MINUS ( label ) 
ttdefine BGE CR ( bit, label ) 
ttdefine BGE CR PLUS ( bit, label ) 
ttdefine BGE CR_MINUS ( bit, label ) 
ttdefine BGELR 
ttdefine BGELR PLUS 
ttdefine BGELR MINUS 
ttdefine BGELR CR( bit ) 
ttdefine BGELR CR PLUS( bit ) 
ttdefine BGELR CR MINUS ( bit ) 
ttdefine BGT ( label ) 
ttdefine BGT PLUS ( label ) 
ttdefine BGT MINUS ( label ) 
ttdefine BGT CR ( bit, label ) 
ttdefine BGT CR PLUS ( bit, label ) 
ttdefine BGT CR_MINUS ( bit, label ) 
ttdefine BGTLR 
ttdefine BGTLR PLUS 
ttdefine BGTLR MINUS 
ttdefine BGTLR_CR( bit ) 



add rD, rA, rB; 

add. rD, rA, rB; 

addi rD, rA, (SIMM) ; 

addic. rD, rA, (SIMM) 

addis rD, rA, (SIMM) ; 

and rA, rS, rB; 

and. rA, rS, rB; 

andc rA, rS, rB; 

andc. rA, rS, rB; 

andi. rA, rS, (UIMM) ; 

andis. rA, rS, (UIMM) 

ba label; 

bctr; 

bctrl; 

beq label; 

beq+ label ; 

beq- label; 

beq (bit) , label ; 

beq+ (bit), label; 

beq- (bit), label; 

beqlr ,- 

beqlr+ ; 

beqlr- ; 

beqlr (bit) ; 

beqlr+ (bit) ; 

beqlr- (bit) ; 

bge label; 

bge+ label ; 

bge- label ,- 

bge (bit) , label; 

bge+ (bit), label ; 

bge- (bit), label; 

bgelr; 

bgelr+ ; 

bgelr- ; 

bgelr (bit) ; 

bgelr+ (bit) ; 

bgelr- (bit) ; 

bgt label; 

bgt+ label; 

bgt- label; 

bgt (bit), label; 

bgt+ (bit), label; 

bgt- (bit), label; 

bgtlr; 

bgtlr+; 

bgtlr-,- 

bgtlr (bit) ; 
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#define BGTLR CR PLUS ( bit ) 
#def ine BGTLR CR MINUS ( bit ) 
#define BL( label ) 
#define BLE ( label ) 
#define BLE PLUS ( label ) 
#define BLE MINUS ( label ) 
ttdefine BLE CR( bit, label ) 
#define BLE CR PLUS { bit, label ) 
#define BLE CR_MINUS ( bit, label ) 
#define BLELR 
ttdefine BLELR PLUS 
#define BLELR MINUS 
#define BLELR CR ( bit ) 
#define BLELR CR PLUS { bit ) 
tjdefine BLELR_CR_MINUS ( bit ) 
fldefine BLR 
^define BLRL 
^define BLT ( label ) 
^define BLT PLUS ( label ) 
^define BLT MINUS { label ) 
^define BLT CR ( bit, label ) 
#define BLT CR PLUS { bit, label ) 
= ... #define BLT CR_MINUS ( bit, label ) 

;I! #define BLTLR 

l=J # define BLTLR PLUS 

|!j #define BLTLR MINUS 

#define BLTLR CR( bit ) 
''if #def ine BLTLR CR PLUS ( bit ) 

hU #define BLTLR CR MINUS ( bit ) 

ifl #define BNE ( label ) 

$H #define BNE PLUS ( label ) 

'il: #def ine BNE MINUS ( label ) 

II' #define BNE CR( bit, label ) 

si #define BNE CR PLUS ( bit, label ) 

f=§ #define BNE CR_MINUS ( bit, label ) 

ft #define BNELR 

W #define BNELR PLUS 

y, #define BNELR MINUS 

> #define BNELR CR( bit ) 

idefine BNELR CR PLUS ( bit ) 
O #define BNELR CR MINUS ( bit ) 

fi? #define BR ( label ) 

#define CLRLWI < rA, rS, nbits ) 
#define CLRLWI C( rA, rS, nbits ) 
#define CLRRWI ( rA, rS, nbits ) 
#define CLRRWI_C ( rA, rS, nbits ) 
#define CMPLW ( rA, rB ) 
#define CMPLW CR ( bit, rA, rB ) 
#def ine CMPLWI ( rA, UIMM ) 
ttdefine CMPLWI CR ( bit, rA, UIMM ) 
ttdefine CMPW ( rA, rB ) 
#define CMPW CR ( bit, rA, rB ) 
#define CMPWI { rA, SIMM ) 
#define CMPWI_CR( bit, rA, SIMM ) 
#define DCBF ( rA, rB ) 
#def ine DCBI ( rA, rB ) 
#define DCBST( rA, rB ) 
tfdefine DCBT ( rA, rB ) 
ftdefine DCBTST ( rA, rB ) 
#if 'defined C0MPILE_N0_DCBZ 
#define DCBZ ( rA, rB ) 
#else 

#define DCBZ ( rA, rB ) 
#endif 

#define DECR ( rD ) 
#define DECR C ( rD ) 
#define DIVW ( rD, rA, rB ) 



bgtlr+ (bit) ; 

bgtlr- (bit) ,- 

bl label; 

ble label; 

ble+ label; 

ble- label; 

ble (bit), label ,- 

ble+ (bit), label; 

ble- (bit), label; 

blelr; 

blelr+; 

blelr- ; 

blelr (bit) ; 

blelr+ (bit) ; 

blelr- (bit) ,- 

blr; 

blrl; 

bit label; 

blt+ label; 

bit- label; 

bit (bit) , label; 

blt+ (bit), label; 

bit- (bit), label; 

bltlr; 

bltlr+; 

bltlr-, - 

bltlr (bit) ; 

bltlr+ (bit) ; 

bltlr- (bit) ; 

bne label; 

bne+ label; 

bne- label; 

bne (bit), label; 

bne+ (bit) , label; 

bne- (bit) , label; 

bnelr; 

bnelr+ ; 

bnelr- ; 

bnelr (bit) ; 

bnelr+ (bit) ; 

bnelr- (bit) ; 

b label ; 

clrlwi rA, rS, (nbits) ,- 
clrlwi. rA, rS, (nbits) 
clrrwi rA, rS, (nbits) ; 
clrrwi. rA, rS, (nbits) 
cmplw rA, rB; 
cmplw bit, rA, rB; 
cmplwi rA, (UIMM) ; 
cmplwi bit, rA, (UIMM) ; 
cmpw rA, rB; 
cmpw bit, rA, rB; 
cmpwi rA, (SIMM) ; 
cmpwi bit, rA, (SIMM) ; 
debf rA, rB; 
debi rA, rB; 
debst rA, rB; 
debt rA, rB; 
debtst rA, rB; 

debz rA, rB ; 

addi rD, rD, -1; 
addic. rD, rD, -1; 
divw rD, rA, rB; 
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w 



ttdefine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#define 
#define 
#define 
#def ine 
#def ine 
^define 
^define 
^define 
#define 
#define 
#def ine 
#define 
#define 
#def ine 
#def ine 
#define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
+(n) -l; 
ttdefine 
+ (n) -1; 
#def ine 
+ (n)-l; 
#define 



DIVW C{ rD, rA ( rB ) 

DIVWU( rD, rA, rB ) 

DIVWU C( rD, rA, rB ) 

EQV( rA, rS, rB ) 

EQV C( rA, rS, rB ) 

EXTLWI ( rA, rS, n, b ) 

EXTLWI C( rA, rS, n, b ) 

EXTRWK rA, rS, n, b ) 

EXTRWI C( rA, rS, n, b ) 

FABS ( frD, frB ) 

FADD ( frD, frA, frB ) 

FADDS ( frD, frA, frB ) 

FCMP0( bit, frA, frB ) 

FCMPU( bit, frA, frB ) 

FCTIW( frD, frB ) 

FCTIWZ( frD, frB ) 

FDIV( frD, frA, frB ) 

FDIVS( frD, frA, frB ) 

FMADD ( frD, frA, frC, frB ) 

FMADDS ( frD, frA, frC, frB ) 

FMOV( frD, frB ) 

FMR ( frD, frB } 

FMUL ( frD, frA, frB ) 

FMULS( frD, frA, frB ) 

FMSUB( frD, frA, frC, frB ) 

FMSUBS( frD, frA, frC, frB ) 

FNABS ( frD, frB ) 

FNEG ( frD, frB ) 

FNMADD ( frD, frA, frC, frB ) 

FNMADDS ( frD, frA, frC, frB ) 

FNMSUB ( frD, frA, frC, frB ) 

FNMSUBS( frD, frA, frC, frB ) 

FRES ( frD, frB ) 

FRSP( frD, frB ) 

FRSQRTE ( frD, frB ) 

FSEL { frD, frA, frC, frB ) 

FSUB( frD, frA, frB ) 

FSUBS{ frD, frA, frB ) 

GOTO( label ) 

INCR( rD ) 

INCR C( rD ) 

INSLWK rA, rS, n, b ) 

INSLWI_C( rA, rS , n, b ) 

INSRWI ( rA, rS, n, b ) 

INSRWI_C( rA, rS, n, b ) 

LA ( rD, symbol, SIMM ) 



#define LABEL ( label ) 
#define LBZ ( rD, rA, d ) 
#define LBZA ( rD, symbol ) 

ttdefine LBZU ( rD, rA, d ) 
#define LBZUX( rD, rA, rB ) 
#define LBZX ( rD, rA, rB ) 
#define LFD ( frD, rA, d ) 
#define LFDU ( frD, rA, d ) 
#define LFDUX( frD, rA, rB ) 
#define LFDX ( frD, rA, rB ) 
#define LFS ( frD, rA, d ) 
#define LFSA ( frD, symbol, rT ) 

#define LFSU ( frD, rA, d ) 
#define LFSUX ( frD, rA, rB ) 
#define LFSX ( frD, rA, rB ) 



divw. rD, rA, rB; 
divwu rD, rA, rB; 
divwu. rD, rA, rB; 
egv rA, rS, rB; 
egv. rA, rS, rB; 
rlwinm rA, rS, (b) , 0, (n)-l; 
rlwinm. rA, rS, (b) , 0, (n) -1 
rlwinm rA, rS, 
rlwinm. rA, rS 
fabs frD, frB; 
fadd frD, frA, frB; 
fadds frD, frA, frB; 
fcmpo bit, frA, frB; 
fcmpu bit, frA, frB; 
fctiw frD, frB; 
fctiwz frD, frB; 
fdiv frD, frA, frB; 
fdivs frD, frA, frB; 
fmadd frD, frA, frC, frB; 
fmadds frD, frA, frC, frB; 
FMR ( frD, frB ) 
fmr frD, frB; 
fmul frD, frA, frB; 
fmuls frD, frA, frB; 
fmsub frD, frA, frC, frB; 
fmsubs frD, frA, frC, frB; 
fnabs frD, frB; 
fneg frD, frB; 
fnmadd frD, frA, frC, frB; 
fnmadds frD, frA, frC, frB; 
fnmsub frD, frA, frC, frB; 
fnmsubs frD, frA, frC, frB; 
fres frD, frB; 
frsp frD, frB; 
frsqrte frD, frB; 
fsel frD, frA, frC, frB; 
fsub frD, frA, frB; 
fsubs frD, frA, frB; 
BR{ label ) 
addi rD, rD, 1; 
addic. rD, rD, 1; 

rlwimi rA, rS, 32- (b) , (b) , (b) + (n)-l; 
rlwimi. rA, rS, 32- (b) , (b) , (b) 

rlwimi rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

rlwimi. rA, rS, 32- ( (b) + (n) ) , (b) , (b) 

addis rD, 0, ( symbol + ( SIMM) ) @ha ; \ 
addi rD, rD, (syrnbol+ (SIMM) ) @1 ; 
label: 

lbz rD, (d) (rA) ; 

addis rD, 0, ( symbol )@ha; \ 

lbz rD, (symbol) @1 (rD) ; 

lbzu rD, (d) (rA) ; 

lbzux rD, rA, rB; 

lbzx rD, rA, rB; 

lfd frD, (d) (rA) ; 

Ifdu frD, (d) (rA) ; 

If dux frD, rA, rB ; 

lfdx frD, rA, rB; 

lf s frD, (d) (rA) ; 

addis rT, 0, (symbol) @ha; \ 

lfs frD, (symbol) @1 (rT) ; 

lfsu frD, (d) (rA) ; 

lfsux frD, rA, rB; 

lfsx frD, rA, rB; 
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#define LHA( rD, rA, d ) 
#define LHAA ( rD, symbol ) 

#define LHAU( rD, rA, d ) 

#define LHAUX( rD, rA, rB ) 

ttdefine LHAX ( rD, rA, rB ) 

#define LHZ( rD, rA, d ) 

^define LHZA ( rD, symbol ) 

fldefine LHZU ( rD, rA, d ) 
ftdefine LHZUX( rD, rA, rB ) 
fldefine LHZX ( rD, rA, rB ) 
^define LI ( rD, SIMM ) 
^define LIS( rD, SIMM ) 
# define L0AD_C0UNT( rD ) 
ttdefine LWZ ( rD, rA, d ) 
#define LWZA{ rD, symbol ) 

#define LWZU( rD, rA, d ) 
#define LWZUX( rD, rA, rB ) 
#define LWZX ( rD, rA, rB ) 
#define MCRF ( crfD, crfS ) 

, . #define MCRFS ( crfD, crfS ) 

5^ #define MFCR ( rD ) 

D ttdefine MFCTR ( rD ) 

Pi ttdefine MFLR ( rD ) 

"% ttdefine MFSPR( rD, SPR ) 

- ttdefine MR ( rA, rS ) 

tfl ttdefine MR C( rA, rS ) 

if; ttdefine MOV( rA, rS ) 

;= ttdefine MOV C( rA, rS ) 

ttdefine MTCR ( rD ) 

fl = ttdefine MTCTR ( rD ) 

ttdefine MTFSFI ( crfD, I MM ) 
ttdefine MTLR ( rD ) 

^ ttdefine MTSPR( SPR, rS ) 

|ij ttdefine MULL I ( rD, rA, SIMM ) 

y. ttdefine MULLW( rD, rA, rB ) 

? Z ttdefine MULLW_C ( rD, rA, rB ) 

4= ttdefine NAND ( rA, rS, rB ) 

ttdefine NAND_C ( rA, rS, rB ) 

fi f ttdefine NEG ( rD, rA ) 

?isi ttdefine NEG_C ( rD, rA ) 

ttdefine NOP 

ttdefine NOR( rA, rS, rB ) 
ttdefine NOR_C ( rA, rS, rB ) 
ttdefine OR ( rA, rS, rB ) 
ttdefine OR C( rA, rS, rB ) 
ttdefine ORC ( rA, rS, rB ) 
ttdefine ORC C( rA, rS, rB ) 
ttdefine ORI ( rA, rS, UIMM ) 
ttdefine ORIS ( rA, rS, UIMM ) 
ttdefine RETURN 

ttdefine RLWIMI { rA, rS, SH, MB, ME ) 
ttdefine RLWIMI C( rA, rS, SH, MB, ME 
ttdefine RLWINM ( rA, rS, SH, MB, ME ) 
ttdefine RLWINM_C( rA, rS, SH, MB, ME 
ttdefine RLWNM( rA, rS, rB, MB, ME ) 
ttdefine RLWNM C( rA, rS, rB, MB, ME ) 
ttdefine ROTLW( rA, rS, rB ) 
ttdefine ROTLW C( rA, rS, rB ) 
ttdefine ROTLWI ( rA, rS, n ) 
ttdefine ROTLWI C( rA, rS, n ) 
ttdefine ROTRWI ( rA, rS , n ) 
ttdefine ROTRWI C( rA, rS, n ) 
ttdefine SLW ( rA, rS, rB ) 
ttdefine SLW_C ( rA, rS, rB ) 
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lha rD, (d) (rA) ; 

addis rD, 0, ( symbol ) @ha ; \ 

lha rD, ( symbol ) @1 (rD) ,- 

lhau rD, (d) (rA) ; 

lhaux rD, rA, rB; 

lhax rD, rA, rB; 

lhz rD, (d) (rA) ; 

addis rD, 0, (symbol) @ha; \ 

lhz rD, (symbol) @1 (rD) ; 

lhzu rD, (d) (rA) ; 

lhzux rD, rA, rB; 

lhzx rD, rA, rB; 

li rD, (SIMM) ; 

lis rD, (SIMM) ; 

mtctr rD; 

lwz rD, (d) (rA) ; 

addis rD, 0, ( symbol ) @ha ; \ 

lwz rD, (symbol) @1 (rD) ; 

lwzu rD, (d) (rA) ; 

lwzux rD, rA, rB; 

lwzx rD, rA, rB; 

mcrf crfD, crfS; 

mcrfs crfD, crfS; 

mfcr rD; 

mfctr rD; 

mflr rD; 

mfspr rD, SPR; 

mr rA, rS; 

or. rA, rS, rS ; 

MR ( rA, rS ) 

MR C( rA, rS ) 

mtcr rD; 

mtctr rD; 

mtfsf i (crfD) , (IMM) ; 

mtlr rD; 

mtspr SPR, rS; 

mulli rD, rA, (SIMM) ; 

mullw rD, rA, rB; 

raullw. rD, rA, rB; 

nand rA, rS, rB; 

nand. rA, rS, rB; 

neg rD, rA; 

neg. rD, rA; 

nop; 

nor rA, rS, rB; 
nor. rA, rS, rB; 
or rA, rS, rB; 
or. rA, rS, rB; 
ore rA, rS, rB; 
ore. rA, rS, rB; 
ori rA, rS, (UIMM) ; 
oris rA, rS, (UIMM) ; 
BLR 

rlwimi rA, rS, SH, MB, ME; 
) rlwimi. rA, rS, SH, MB, ME; 

rlwinm rA, rS, SH, MB, ME; 
) rlwinm. rA, rS, SH, MB, ME; 

rlwnm rA, rS, rB, MB, ME; 

rlwnm. rA, rS , rB, MB, ME; 

rlwnm rA, rS, rB, 0, 31; 

rlwnm. rA, rS, rB, 0, 31; 

rlwinm rA, rS, (n) , 0, 31; 

rlwinm. rA, rS, (n) , 0, 31; 

rlwinm rA, rS, 32- (n), 0, 31; 

rlwinm. rA, rS, 32- (n) , 0, 31; 

slw rA, rS, rB; 

slw. rA, rS, rB; 
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#define SLWI ( rA, rS, SH ) 
#define SLWI C( rA, rS, SH ) 
ttdefine SRAW ( rA, rS, rB ) 
ttdefine SRAW C( rA, rS, rB ) 
ttdefine SRAWI ( rA, rS, SH ) 
ttdefine SRAWI C( rA, rS, SH ) 
#define SRW ( rA, rS, rB ) 
#define SRW C( rA, rS, rB ) 
#define SRWI ( rA, rS, SH ) 
ttdefine SRWI_C ( rA, rS, SH ) 
^define STB ( rS, rA, d ) 
ffdefine STBU( rS, rA, d ) 
^define STBUX( rS, rA, rB ) 
fldefine STBX( rS, rA, rB ) 
#define STFD ( frD, rA, d ) 

define STFDU( frD, rA, d ) 
#define STFDUX ( frD, rA, rB ) 
ttdefine STFDX{ frD, rA, rB ) 
#define STFS ( frD, rA, d ) 
#define STFSU( frD, rA, d ) 
#define STFSUX ( frD, rA, rB ) 
#define STFSX{ frD, rA, rB ) 
#define STH( rS, rA, d ) 
#define STHU( rS, rA, d ) 
D #define STHUX ( rS, rA, rB ) 

#define STHX( rS, rA, rB ) 
.7 #define STW ( rS, rA, d ) 

~ #define STWU( rS, rA, d ) 

tfl #define STWUX( rS, rA, rB ) 

if| #define STWX( rS, rA, rB ) 

~S #define SUB ( rD, rA, rB ) 

'Sf #define SUB C( rD, rA, rB ) 

IP #define SUBFIC{ rD, rA, SIMM ) 

, #define SUBI ( rD, rA, SIMM ) 

#define SUBIC C( rD, rA, SIMM ) 
y*. #define SUBIS ( rD, rA, SIMM ) 

lM #define TEST_COUNT< label ) 

hh #define XOR ( rA, rS, rB ) 

> #define XOR C( rA, rS, rB ) 

#define XORI ( rA, rS, UIMM ) 
O #define XORIS ( rA, rS, UIMM ) 

1.4 /* 

* VMX instructions 

#define BR VMX ALL TRUE ( label ) 
#define BR VMX ALL FALSE ( label ) 
#define BR VMX NONE TRUE ( label ) 
#define BR VMX SOME FALSE { label ) 
#define BR VMX SOME_TRUE ( label ) 
#define DSS ( STRM ) 
#define DSSALL 
#define DST ( rA, rB, STRM ) 
#define DSTST ( rA, rB, STRM ) 
ttdefine DSTT ( rA, rB, STRM ) 
#define DSTSTT ( rA, rB, STRM ) 
#define LVEBX( vT, rA, rB ) 
ttdefine LVEHX( vT, rA, rB ) 
#define LVEWX( vT, rA, rB ) 

#if defined { LITTLE ENDIAN ) 

#define LVSL ( vT, rA, rB ) 

#define LVSR ( vT, rA, rB ) 
#else 

ttdefine LVSL ( vT, rA, rB ) 

#define LVSR ( vT, rA, rB ) 
#endif 



slwi rA, rS, (SH) ; 
slwi. rA, rS, (SH) ; 
sraw rA, rS, rB; 
sraw. rA, rS, rB; 
srawi rA, rS, (SH) ; 
srawi. rA, rS, (SH) ; 
srw rA, rS, rB; 
srw. rA, rS, rB; 
srwi rA, rS, (SH) ; 
srwi. rA, rS, (SH) ; 
stb rS, (d) (rA) ; 
stbu rS, (d) (rA) ; 
stbux rS, rA, rB; 
stbx rS, rA, rB; 
stfd frD, (d) (rA) ; 
stfdu frD, (d) (rA) ; 
stfdux frD, rA, rB; 
stfdx frD, rA, rB; 
stfs frD, (d) (rA) ; 
stfsu frD, (d) (rA) ; 
stfsux frD, rA, rB; 
stfsx frD, rA, rB; 
sth rS, (d) (rA) ; 
sthu rS, (d) (rA) ? 
sthux rS, rA, rB ; 
sthx rS, rA, rB; 
stw rS, (d) (rA) ; 
stwu rS, (d) (rA) ; 
stwux rS, rA, rB ; 
stwx rS, rA, rB; 
sub rD, rA, rB; 
sub. rD, rA, rB; 
subfic rD, rA, (SIMM) 
subi rD, rA, (SIMM) ; 
subic. rD, rA, (SIMM) 
subis rD, rA, (SIMM) ; 
bdnz label ; 
xor rA, rS, rB; 
xor. rA, rS, rB; 
xori rA, rS, (UIMM) ; 
xor is rA, rS, (UIMM) ; 



bt 24, label; 

bt 26, label ; 

bt 26, label; 

bf 24, label; 

bf 26, label; 

dss STRM, 0; 

dss 0, 1; 

dst rA, rB, STRM; 

dstst rA, rB, STRM; 

dstt rA, rB, STRM; 

dststt rA, rB, STRM; 

Ivebx vT, rA, rB; 

Ivehx vT, rA, rB; 

lvewx vT, rA, rB; 



Ivsr vT, rA, rB; 

lvsl vT, rA, rB; 

lvsl vT, rA, rB; 

lvsr vT, rA, rB; 
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#define LVX ( vT, rA, rB ) 
#define LVXL ( vT, rA, rB ) 
#define STVEBX ( vS, rA, rB ) 
fldefine STVEHX( vS, rA, rB ) 
^define STVEWX ( vS, rA, rB ) 
fldefine STVX( vS, rA, rB ) 
^define STVXL { vS, rA, rB ) 
^define VADDFP ( vT, vA, vB ) 
^define VADDSBS { vT, vA, vB ) 
#define VADDSHS ( vT, vA, vB ) 
#define VADDSWS ( vT, vA, vB ) 
#define VADDUBM ( vT, vA, vB ) 
#define VADDUBS ( vT, vA, vB ) 
ttdefine VADDUHM ( vT, vA, vB ) 
ttdefine VADDUHS ( vT, vA, vB ) 
#define VADDUWM( vT, vA, vB ) 
#define VADDUWS ( vT, vA, vB ) 
#define VAND ( vT, vA, vB ) 
#define VANDC ( vT, vA, vB ) 
#define VCMPEQFP ( vT, vA, vB ) 
#define VCMPEQFP C( vT, vA, vB ) 
#define VCMPEQUB ( vT, vA, vB ) 
r* #define VCMPEQUB C( vT, vA, vB ) 

O #define VCMPEQUH( vT, vA, vB } 

rj #define VCMPEQUH C( vT, vA, vB ) 

^ #define VCMPEQUW ( vT, vA, vB ) 

W #define VCMPEQUW C{ vT, vA, vB ) 

Sil #define VCMPGEFP( vT, vA, vB ) 

.ft #define VCMPGEFP C( vT, vA, vB ) 

#define VCMPGTFP( vT, vA, vB ) 
W #define VCMPGTFP C( vT, vA, vB ) 

P #define VCMPGTSB ( vT, vA, vB } 

, #define VCMPGTSB C( vT, vA, vB ) 

~. #define VCMPGTSH( vT, vA, vB ) 

W #define VCMPGTSH C( vT, vA, vB ) 

yj #define VCMPGTSW ( vT, vA, vB ) 

i 4 . #define VCMPGTSW C( vT, vA, vB ) 

*Z #define VCMPGTUB ( vT, vA, vB ) 

4" #define VCMPGTUB C( vT, vA, vB ) 

D #define VCMPGTUH ( vT, vA, vB ) 

#define VCMPGTUH C( vT, vA, vB ) 
i?=# #define VCMPGTUW ( vT, vA, vB ) 

ttdefine VCMPGTUW C( vT, vA, vB ) 
ttdefine VCFSX ( vT, vB, UIMM ) 
ttdefine VCFUX( vT, vB, UIMM ) 
#define VCTSXS ( vT, vB, UIMM } 
#define VCTUXS ( vT, vB, UIMM ) 
#define VEXPTEFP ( vT, vB ) 
#define VLOGEFP( vT, vB ) 
#define VMADDFP ( vT, vA, vC, vB ) 
ttdefine VMAXFP ( vT, vA, vB ) 
#define VMAXSB ( vT, vA, vB ) 
#define VMAXSH ( vT, vA, vB ) 
#define VMAXSW ( vT, vA, vB ) 
#define VMAXUB ( vT, vA, vB ) 
ttdefine VMAXUH ( vT, vA, vB ) 
#define VMAXUW ( vT, vA, vB ) 
#define VMHADDSHS ( vD, vA, vB, vC 
#define VMHRADDSHS ( vD, vA, vB, vC 
#define VMINFP ( vT, vA, vB ) 
#define VMINSB ( vT, vA, vB ) 
#define VMINSH( vT, vA, vB ) 
#define VMINSW ( vT, vA, vB ) 
ttdefine VMINUB ( vT, vA, vB ) 
#define VMINUH( vT, vA, vB ) 
#define VMINUW ( vT, vA, vB ) 
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lvx vT, rA, rB; 
lvxl vT, rA, rB; 
stvebx vS, rA, rB; 
stvehx vS, rA, rB; 
stvewx vS, rA, rB; 
stvx vS, rA, rB; 
stvxl vS, rA, rB; 
vaddfp vT, vA, vB; 
vaddsbs vT, vA, vB; 
vaddshs vT, vA, vB; 
vaddsws vT, vA, vB; 
vaddubm vT, vA, vB; 
vaddubs vT, vA, vB; 
vadduhm vT, vA, vB; 
vadduhs vT, vA, vB; 
vadduwm vT, vA, vB; 
vadduws vT, vA, vB; 
vand vT, vA, vB; 
vandc vT, vA, vB; 
vcmpeqfp vT, vA, vB; 
vcmpeqfp. vT, vA, vB; 
vcmpequb vT, vA, vB; 
vcmpequb. vT, vA, vB; 
vcmpequb. vT, vA, vB; 
vcmpequh. vT, vA, vB; 
vcmpequw vT, vA, vB; 
vcmpequw. vT, vA, vB; 

vcmpgefp vT, vA, vB; 

vcmpgefp. vT, vA, vB ; 

vcmpgtfp vT, vA, vB; 

vcmpgtfp. vT, vA, vB; 

vcmpgtsb vT, vA, vB; 

vcmpgtsb. vT, vA, vB ; 

vcmpgtsh vT, vA, vB; 

vcmpgtsh. vT, vA, vB ; 

vcmpgtsw vT, vA, vB; 

vcmpgtsw. vT, vA, vB; 

vcmpgtub vT, vA, vB; 

vcmpgtub. vT, vA, vB; 

vcmpgtuh vT, vA, vB; 

vcmpgtuh. vT, vA, vB; 

vcmpgtuw vT, vA, vB; 

vcmpgtuw. vT, vA, vB; 

vcfsx vT, vB, (UIMM) ; 

vcfux VT, vB, (UIMM) ; 

vctsxs vT, vB, (UIMM) ; 

VCtUXS vT, vB, (UIMM) ; 

vexptefp vT, vB; 

vlogefp vT, vB; 

vmaddfp vT, vA, vC, vB; 

vmaxfp vT, vA, vB; 

vmaxsb vT, vA, vB; 

vmaxsh vT, vA, vB; 

vraaxsw vT, vA, vB ; 

vmaxiob vT, vA, vB; 

vmaxuh vT, vA, vB; 

vmaxuw vT, vA, vB; 
) vmhaddshs vD, vA, vB, vC; 
) vrahraddshs vD, vA, vB, vC; 

vminfp vT, vA, vB; 

vminsb vT, vA, vB; 

vminsh vT, vA, vB; 

vminsw vT, vA, vB; 

vminub vT, vA, vB; 

vminuh vT, vA, vB; 

vminuw vT, vA, vB; 
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. VMLADDUHM ( vD, vA, vB, vC ) 



vmladduhm vD, vA, vB, vC; 



fldefine VMR( vD, 



flif defi 
Ddefine 
#def ine 
#def ine 
#define 
#def ine 
#define 
#else 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#endif 

ttdefine 
#def ine 
#def ine 
#def ine 
#define 
#define 
#define 
#def ine 
#define 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 

#if defi 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttelse 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttendif 



ttdefine VREFP ( vT, vB ) 
ttdefine VRFIM( vT, vB ) 
ttdefine VRFIN( vT, vB ) 
ttdefine VRFIP{ vT, vB ) 
ttdefine VRFIZ{ vT, vB ) 
ttdefine VRLB ( vT, vA, vB ) 
ttdefine VRLH ( vT, vA, vB ) 



) 




vor vD 


vS 


vS 




ENDIAN ) 










vA, 


vB ) 


vmrglb 


vT, 


vB, 


vA; 


vA, 


vB ) 


vmrglh 


vT, 


vB, 


vA; 


vA ; 


vB ) 


vmrglw 


vT, 


vB, 


vA; 


vA, 


vB ) 


vmrghb 


vT, 


vB, 


VA; 


vA, 


vB ) 


vmrghh 


vT, 


vB, 


vA; 


VA, 


vB ) 


vmrghw 


vT, 


vB, 


VA; 


vA, 


vB ) 


vmrghb 


vT, 


vA, 


vB; 


vA, 


vB ) 


vmrghh 


vT, 


vA, 


vB; 


vA, 


vB ) 


vmrghw 


vT, 


vA, 


vB; 


vA, 


vB ) 


vmrglb 


vT, 


vA, 


vB; 


vA, 


vB ) 


vmrglh 


vT, 


vA, 


vB; 


vA, 


vB ) 


vmrglw 


vT, 


vA, 


vB; 



ned( LITTLE 
VMRGHB ( vT," 
VMRGHH ( vT, 
VMRGHW ( vT, 
VMRGLB ( VT, 
VMRGLH ( vT, 
VMRGLW ( vT, 

VMRGHB ( vT, 
VMRGHH ( vT, 
VMRGHW ( vT, 
VMRGLB ( vT, 
VMRGLH ( vT, 
VMRGLW ( vT, 



VMSUMMBM ( vT, 
VMSUMSHM( vT, 
VMSUMSHS( vT, 
VMSUMUBM( vT, 
VMSUMUHM( vT, 
VMSUMUHS( vT, 
VMULESB( vT, 
VMULESH{ vT, 
VMULEUB( vT, 
VMULEUH( vT f 
VMUL0SB( vT, 
VMUL0SH( vT, 
VMULOUBC vT, 
VMUL0UH( vT, 
VNMSUBFP( vT, 
VN0R{ vT, VA, 
VNOT( vT, vA 
VOR( vT, vA, 



ned( LITTLE ENDIAN ) 
VPERM( vT, vA, vB, vC ) 
VPKUHUM( vT, vA, vB ) 
VPKUHUS( vT, 
VPKSHUS ( vT, 
VPKSHSS( vT, vA, 
VPKUWUM( vT, vA, 
VPKUWUS( vT, 
VPKSWUS( VT, 
VPKSWSS ( vT, vA, vB ) 

VPERM( vT, vA, vB, vC ) 

VPKUHUM( vT, vA, vB ) 

VPKUHUS( vT, vA, vB ) 

VPKSHUS( vT, vA, vB ) 

VPKSHSS( vT, vA, vB ) 

VPKUWUM( vT, vA, vB ) 

VPKUWUS( vT, vA, vB ) 

VPKSWUS( vT, vA, vB ) 

VPKSWSS ( vT, vA, vB ) 



vA, vB, vC ) 
vA, vB, vC ) 
vA, vB, vC ) 
vA, vB, vC ) 
vA, vB, vC ) 
vA, vB, vC ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 

vA, vB ) 
vA, vC, vB ) 



vB ) 

) 

vB ) 



vB ) 
vB ) 
vB ) 
vB ) 
vB ) 
vB ) 



vmsummbm vT 


vA, 


vB, 


vC; 


vmsumshm vT 


vA, 


vB, 


vC; 


vmsumshs vT 


vA, 


vB, 


vC; 


vmsumubm vT 


vA, 


vB, 


vC; 


vmsumuhm vT 


vA, 


vB, 


vC; 


vmsumuhs vT 


vA, 


vB, 


vC; 


vmulesb vT, 


vA, 


vB; 




vmulesh vT, 


vA, 


VB; 




vmuleub vT, 


vA, 


VB; 




vmuleuh vT, 


vA, 


vB; 




vmulosb vT, 


vA, 


vB; 




vmulosh vT, 


vA, 


vB; 




vmuloub vT, 


vA, 


vB; 




vmulouh vT, 


vA, 


vB; 




vnmsubfp vT 


vA 


vC, 


vB; 


vnor vT, vA 


vB 






vnor vT, vA 


vA 






vor vT, vA, 


vB; 






vperm vT, vB, vA, vC 




vpkuhum vT, 


vB, 


vA; 




vpkuhus vT, 


vB, 


vA; 




vpkshus vT, 


vB, 


vA; 




vpkshss vT, 


vB, 


vA; 




vpkuwum vT, 


vB, 


vA; 




vpkuwus vT, 


vB, 


vA; 




vpkswus vT, 


vB, 


vA; 




vpkswss vT, 


vB, 


vA; 





vperm vT, vA, vB, 1 
vpkuhum vT, vA, vB 
vpkuhus vT, vA, vB; 
vpkshus vT, vA, vB; 
vpkshss vT, vA, vB; 
vpkuwum vT, vA, 
vpkuwus vT, vA. 
vpkswus vT, vA, 
vpkswss vT, vA. 



vB; 

vB; 
VB; 



vrefp vT, vB; 
vrfim vT, vB; 
vrfin vT, vB; 
vrfip vT, vB; 
vrfiz vT, vB; 
vrlb vT, vA, vB; 
vrlh vT, vA, vB; 
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ttdefine VRLW ( vT, vA, vB ) 
ttdefine VRSQRTEFP ( vT, vB ) 
^define VSEL ( vT, vA, vB, vC ) 
^define VSL ( vT, vA, vB ) 

flif defined ( LITTLE_ENDIAN ) 
#define VSLDOI ( vT, vA, vB, UIMM ) 
#else 

ttdefine VSLDOI ( vT, vA, vB, UIMM ) 
#endif 



#define 
#def ine 
#def ine 
#define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
ttdefine 
#define 
ttdefine 
#def ine 
#def ine 
#define 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 



VSLB( vT, vA, vB ) 
VSLH( vT, vA, vB ) 
VSLO( vT, vA, vB ) 
VSLW( vT, vA, vB ) 
VSR( vT, vA, vB ) 
VSRAB( vT, vA, vB ) 
VSRAH( vT, vA, vB ) 
VSRAW( vT, vA, vB ) 
VSRB( vT, vA, vB ) 
VSRH( vT, vA, vB ) 
VSRO( vT, vA, vB ) 
VSRW( vT, vA, vB ) 
VSPLTB( vT, vB, UIMM ) 
VSPLTH( vT, vB, UIMM ) 
VSPLTW( vT, vB, UIMM ) 
VSPLTISB{ vT, SIMM ) 
VSPLTISHt vT, SIMM ) 
VSPLTISW( vT, SIMM ) 
VSUBFP{ vT, vA, vB ) 



vA, 



vB ) 
vB ) 
vB ) 
vB ) 
vB ) 
vB ) 



VSUBSBS( vT, 
VSUBSHS( vT, 
VSUBSWS( vT, 
VSUBUBM( vT, 
VSUBUBS( vT, 
VSUBUHM( vT, 
VSUBUHS ( vT , 
VSUBUWM( vT, 
VSUBUWS ( vT, 
VSUMSWS ( vT, 
VSUM2SWS( vT, vA, vB ) 
VSUM4SBS{ vT, vA, vB ) 
VSUM4SHS( vT, vA, vB ) 
VSUM4UBS( vT, vA, vB ) 



vB ) 
vB ) 
vB ) 



ttif defined ( LITTLE ENDIAN ) 
ttdefi 
ttdefi 



vrlw vT, vA, vB; 
vrsqrtefp vT, vB; 
vsel vT, vA, vB, vC; 
vsl vT, vA, vB; 



vsldoi vT, vB, vA, (16 - (UIMM)); 
vsldoi vT, vA, vB, (UIMM) ; 



vslb vT, vA, vB; 
vslh vT, vA, vB; 
vslo vT, vA, vB,- 
vslw vT, vA, vB; 
vsr vT, vA, vB; 
vsrab vT, vA, vB; 
vsrah vT, vA, vB; 
vsraw vT, vA, vB; 
vsrb vT, vA, vB. 
vsrh vT, vA, vB 
vsro vT, vA, vB 
vsrw vT, vA, vB 
vspltb vT, vB, ( 
vsplth vT, vB 



INDEX MUNGE ( UIMM ) ; 
INDEX MUNGE ( UIMM ) ; 
vspltw vT, vB, L INDEX_MUNGE( UIMM ) ; 
vspltisb vT, (SIMM) ; 
vspltish vT, (SIMM) ; 
vspltisw vT, (SIMM) ; 
vsubfp vT, vA, vB; 
vsubsbs vT, vA, vB; 
vsubshs vT, vA, vB; 
vsubsws vT, vA, vB; 

vA, vB; 
vA, vB; 



vsububm vT 
vsububs vT ; 
vsubuhm vT, 
vsubuhs vT 
vsubuwm vT 
vsubuws vT 
vsumsws vT 
vsum2sws vT, vA, vB; 
vsum4sbs vT, vA, vB ; 
vsum4shs vT, vA, vB; 
vsum4ubs vT, vA, vB; 



vA, vB; 
vA, vB; 
vA, vB; 



VUPKHSB ( 


vT, 


vB ) 


vupklsb 


vT, 


vB; 


VUPKHSH( 


vT, 


vB ) 


vupklsh 


vT, 


vB; 


VUPKLSB ( 


vT, 


vB ) 


vupkhsb 


vT, 


vB; 


VUPKLSH ( 


vT, 


vB ) 


vupkhsh 


vT, 


VB; 


VUPKHSB ( 


vT, 


vB ) 


vupkhsb 


vT, 


vB; 


VUPKHSH ( 


vT, 


vB ) 


vupkhsh 


vT, 


vB; 


VUPKLSB ( 


vT, 


vB ) 


vupklsb 


vT, 


vB; 


VUPKLSH ( 


vT, 


vB ) 


vupklsh 


vT, 


VB; 


VXOR( vT 


vA 


vB ) 


vxor vT 


vA 


vB 



/* 



stack and register macros 



*/ 

ttdefine VRSAVE_COND 7 
ttundef VOLATILE_rl3 
ttdefine MIN_STACK_ALIGN 16 



/* recommended VR condition bit */ 
/* rl3 volatile or non-volatile */ 
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#define MIN_STACK_ALIGN_MASK ( M I N_S T AC K_AL I GN - 1) 

#define ALIGN STACK ( nbytes ) \ 

( ( (nbytes) + MIN_STACK_ALIGN_MASK) & ~MIN_STACK_ALIGN_MASK) 

fldefine LR SAVE OFF 4 

^define FPR_SAVE_OFF (- (32-14) *8) 

flif defined ( VOLATILE_rl3 ) 

^define GPR_SAVE_OFF (FPR_SAVE_OFF - (32-14) *4) 
Seise 

#define GPR_SAVE_OFF ( FPR_SAVE_OFF - (32-13) *4) 
#endif 

#define CR_SAVE_OFF (GPR_SAVE_OFF - 4) 
#if defined ( BUILD_MAX ) 

#define VRSAVE_SAVE_OFF (CR_SAVE_OFF - 4) 
#if defined ( VOLATILE rl3 ) 

#define ALIGNMENT_PADDING_OFF (VRSAVE_SAVE_OFF - 0) 
#else 

#define ALIGNMENT_PADDING_OFF (VRSAVE_SAVE_OFF - 12) 
#endif 

#define VR SAVE OFF (ALIGNMENT_PADDING_OFF - (32-20) *16) 
#define LAST_OFF VR_SAVE_OFF 

#else 

#define LAST_OFF CR_SAVE_OFF 
#endif 

#define REG SAVE SIZE (-LAST_OFF) 
#define MAX NARGS 18 
#define ARGS SIZE (MAX_NARGS * 4) 
#define LINK SIZE 8 

#define STACK_FRAME_SIZE (REG_SAVE_SIZE + ARGS_SIZE + LINK_SIZE) 

/ '* macros to obtain the byte offset into the stack for the last FPR 

* and GPR registers for small temporary storage. 

* FPR_SAVE AREA OFFSET points to an area of 8 * (# of unsaved non-volatile 

* FPR registers) . 

* GPR_SAVE AREA OFFSET points to an area of 4 * (# of unsaved non-volatile 

* GPR registers) . . 

* GET FPR SAVE AREA places the start of the FPR save area into a register 

* GET_GPR_SAVE_AREA places the start of the GPR save area into a register 

* For MAX only: 

* VR_SAVE AREA OFFSET points to an area of 16 * (# of unsaved non-volatile 

* VR registers) . 

* GET_VR_SAVE_AREA places the start of the VR save area into a register 

#define FPR SAVE AREA OFFSET FPR SAVE OFF 
#define GPR_SAVE_AREA_OFFSET GPR_SAVE_OFF 

#define GET FPR SAVE AREA ( ptr ) \ 

addi ptr, sp, FPR_SAVE_AREA_OFFSET; 

#define GET GPR SAVE AREA ( ptr ) \ 

addi ptr, sp, GPR_SAVE_AREA_OFFSET ; 

#if defined ( BUILD_MAX ) 
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#define VR_SAVE_AREA_OFFSET VR_SAVE_OFF 

#define GET VR SAVE AREA ( ptr ) \ 

addi ptr, sp, VR_SAVE_AREA_OFFSET; 
ijendif 

/ * if the function creates a stack frame with local storage, 

* LOCAL STORAGE OFFSET is the stack offset to the start of this 

* storage and is guaranteed to have the minimum stack alignment. 
*/ 

ttdefine LOCAL_STORAGE_OFFSET (LINK_SIZE + ARGS_SIZE) 



* CREATE_STACK FRAME [ X] creates a stack frame that can handle up to 

* 18 GPR register arguments and a local storage size <= 

* 32768 - 512 = 32,256 bytes. 

* CREATE_STACK_FRAME_X destroys rO . 

* For CREATE_STACK_FRAME_X, local_nbytes_reg must not be rO. 

15 * Both CREATE STACK FRAME [ X] and DESTROY STACK FRAME should not be 

* called before registers are saved or after they are restored. 

^ * The stack pointer "output from" CREATE STACK_FRAME [_X] must be 

Cl * the same "input to" DESTROY_STACK_FRAME . 

.2 #define CREATE STACK FRAME ( local nbytes ) \ 

M stwu sp< -ALIGN_STACK ( STACK_FRAME_S I ZE + ( local_nbytes) ) (sp) ; 

g #define CREATE STACK FRAME X( local nbytes reg ) \ v 
U addi rO, local nytes reg, (STACK FRAME_SIZE + MIN_STACK_ALIGN_SIZE) ; \ 

'*<■*« andi. rO, rO , ~MIN_STACK_ALIGN_MASK; \ 

If J stwux sp, sp, rO; 

r Z ttdefine DESTROY STACK_FRAME \ 

4; lwz sp, 0 (sp) ; 

^ ' / * macros to allocate and free space on the user stack. 

* with a fixed alignment of MIN STACK ALIGN . 

* nbytes must be <= (32768 - 432 = 32,336). 

* On return, sp points to a buffer of nbytes bytes. 

#define PUSH STACK ( nbytes ) \ 

addi sp, sp, -ALIGN_STACK< REG_SAVE_SIZE + (nbytes) ); 

#define POP STACK ( nbytes ) \ 

addi sp, sp, ALIGN_STACK( REG_SAVE_SIZE + (nbytes) ) ; 

#define ALLOCATE STACK SPACE ( ptr, nbytes ) \ 
PUSH STACK ( nbytes ) \ 
mr ptr, sp; 

#define FREE_STACK_SPACE ( nbytes ) POP_STACK( nbytes ) 

* macros to create and destroy a stack buffer with a variable 

* alignment and size. 

* CREATE STACK BUFFER [ X] creates a buffer of size nbytes and _ al ignment 

* byte align on the stack, returning a pointer to the buffer m the 

* GPR bufferp. 
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* bufferp must be a GPR other than rO and rl (sp) . 

* byte align must be a power of 2 such that 2 <= byte_align <= 4096. 

* CREATE_STACK_BUFFER destroys rO . 

* CREATE STACK BUFFER [ X] stores the original value of the stack pointer 

* below the buffer at offset 0 from the new stack pointer. 

* DESTROY STACK BUFFER sets the stack pointer to the value stored 

* at the address pointed to by the input stack pointer. 

* Both CREATE STACK BUFFER [ X] and DESTROY STACK BUFFER should not be 

* called before registers are saved or after they are restored. 

* The stack pointer "output from" CREATE STACK_BUFFER [_X] must be 

* the same "input to" DESTROY_STACK_BUFFER . 

#define CREATE STACK BUFFER ( bufferp, byte align, nbytes ) \ 

addis bufferp, sp, ( - (REG SAVE SIZE + (nbytes)) + 32768)@h; \ 

li rO, (((byte align) - 1) | MIN STACK ALIGN MASK); \ 

addi bufferp, bufferp, ( - (REG_SAVE_SIZE + (nbytes) )) @1 ; \ 

andc bufferp, bufferp, rO; \ 

sub rO, bufferp, sp; \ 

addic rO, rO , - MI N_S TACK_AL I GN ; \ 

stwux sp, sp, rO; 

#define CREATE STACK BUFFER X( bufferp, byte_align, nbytes_reg ) \ 
sub bufferp, sp, nbytes_reg; \ 

li rO, (((byte align) - 1) 1 MIN STACK_ALIGN_MASK) ; \ 

addi bufferp, bufferp, -REG SAVE_SIZE; \ 

andc bufferp, bufferp, rO ; \ 

sub rO, bufferp, sp; \ 

addic rO, rO, -MIN_STACK_ALIGN; \ 

stwux sp, sp, rO; 

#define DESTROY STACK_BUFFER \ 
lwz sp, 0 (sp) ; 

' ' * macros to create and destroy the salcache buffer on the user stack. 

* CRE ATE_S TACK_S ALCACHE destroys rO . 

* Both CREATE STACK SALCACHE and DESTROY STACK SALCACHE should not be 

* called before registers are saved or after they are restored. 

ttdefine CREATE STACK SALCACHE ( cachep ) \ 

CREATE_STACK_BUFFER ( cachep, S ALC ACHE_AL I GN , SALCACHE_ALLOC_SIZE ) 

#define DESTROY_STACK_SALCACHE DESTROY_STACK_BUFFER 

/* 

* macros for saving and restoring non-volatile 

* floating point registers (FPRs) 
*/ 
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^define SAVE fl4 f27 SR fl4 f 27 ( stfd ) 

^define SAVE fl4 f28 SR fl4 f 28 ( stfd ) 

^define SAVE fl4 f29 SR fl4 f29( stfd ) 

^define SAVE fl4 f30 SR fl4 f 30 ( stfd ) 

#define SAVE_fl4_f31 SR_f 14_f 31 ( stfd ) 

#def ine SAVE dl4 SR_f 14 ( stfd ) 

#define SAVE dl4 dl5 SR fl4 f 15 ( stfd ) 

#define SAVE dl4 dl6 SR fl4 f 16 ( stfd ) 

#define SAVE dl4 dl7 SR fl4 fl7( stfd ) 

#define SAVE dl4 dl8 SR fl4 f 18 ( stfd ) 

#define SAVE dl4 dl9 SR fl4 fl9{ stfd ) 

#define SAVE dl4 d20 SR fl4 f20( stfd ) 

ttdefine SAVE dl4 d21 SR fl4 f 21 { stfd ) 

#def ine SAVE d!4 d22 SR f 14 f 22 ( stfd ) 

ttdefine SAVE dl4 d23 SR fl4 f 23 ( stfd ) 

#def ine SAVE dl4 d24 SR f 14 f 24 ( stfd ) 

#define SAVE dl4 d25 SR fl4 f25( stfd ) 

ttdefine SAVE dl4 d26 SR fl4 f 26 ( stfd ) 

#define SAVE dl4 d27 SR fl4 f27{ stfd ) 

ttdefine SAVE dl4 d28 SR fl4 f28( stfd ) 

#define SAVE dl4 d29 SR fl4 f29( stfd ) 

#define SAVE dl4 d30 SR fl4 f 30 ( stfd ) 

#define SAVE_dl4_d31 SR_fl4_f31( stfd ) 

#define REST fl4 SR_f 14 ( lfd ) 

ttdefine REST fl4 fl5 SR fl4 f 15 ( lfd ) 

#define REST fl4 fl6 SR fl4 f 16 ( lfd ) 

#define REST fl4 fl7 SR fl4 f 17 ( lfd ) 

#define REST fl4 fl8 SR fl4 f 18 ( lfd ) 

#define REST fl4 fl9 SR fl4 f 19 ( lfd ) 

#define REST fl4 f20 SR fl4 f20( lfd ) 

#define REST fl4 f21 SR fl4 f 21 ( lfd ) 

#define REST fl4 f22 SR fl4 f 22 ( lfd ) 

#define REST fl4 f23 SR f!4 f 23 ( lfd ) 

#define REST f 14 f 24 SR f 14 f 24 ( lfd ) 

#define REST fl4 f25 SR fl4 f 25 ( lfd ) 

ttdefine REST fl4 f26 SR fl4 f 26 ( lfd ) 

ttdefine REST fl4 f27 SR fl4 f 27 ( lfd ) 

ttdefine REST fl4 f28 SR fl4 f 28 ( lfd ) 

ttdefine REST fl4 f29 SR fl4 f 29 ( lfd } 

ttdefine REST fl4 f30 SR fl4 f 30 ( lfd ) 

ttdefine REST_fl4_f31 SR_f 14_f 31 ( lfd ) 

ttdefine REST dl4 SR_fl4 ( lfd ) 

ttdefine REST dl4 dl5 SR fl4 f 15 ( lfd ) 

ttdefine REST dl4 dl6 SR fl4 f 16 ( lfd ) 

ttdefine REST dl4 dl7 SR fl4 fl7( lfd ) 

ttdefine REST dl4 dl8 SR fl4 f 18 ( lfd ) 

ttdefine REST dl4 dl9 SR fl4 fl9( lfd ) 

ttdefine REST dl4 d20 SR fl4 f20( lfd ) 

ttdefine REST dl4 d21 SR fl4 f 21 ( lfd ) 

ttdefine REST dl4 d22 SR fl4 f 22 ( lfd ) 

ttdefine REST dl4 d23 SR fl4 f 23 ( lfd ) 

ttdefine REST dl4 d24 SR fl4 f 24 ( lfd ) 

ttdefine REST dl4 d25 SR fl4 f 25 ( lfd ) 

ttdefine REST dl4 d26 SR fl4 f 26 ( lfd ) 

ttdefine REST dl4 d27 SR fl4 f27( lfd ) 

ttdefine REST dl4 d28 SR fl4 f28( lfd ) 

ttdefine REST dl4 d29 SR fl4 f29{ lfd ) 

ttdefine REST dl4 d30 SR fl4 f 30 ( lfd ) 

ttdefine REST_dl4_d31 SR_f 14_f 31 ( lfd ) 

' ' * macros common to both FPR save and restore 
ttdefine SR_f 14 ( opcode ) \ 
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opcode fl4, (FPR_SAVE OFF + 17*8) (sp) ; 
#def ine SR f 14_f 15 ( opcode ) \ 

opcode fl5, (FPR_SAVE_OFF + 16*8) (sp) ; \ 
SR f 14 ( opcode ) 
#def ine SR f 14_f 16 ( opcode ) \ 

opcode fl6, (FPR SAVE_OFF + 15*8) (sp) ; \ 
SR f 14 f 15 ( opcode ) 
#define SR f 14_f 17 ( opcode ) \ 

opcode fl7, (FPR SAVE_OFF + 14*8) (sp) ; \ 
SR f 14 f 16 ( opcode ) 
#def ine SR f 14_f 18 ( opcode ) \ 

opcode fl8, (FPR SAVE_OFF + 13*8) (sp) ; \ 
SR fl4 fl7( opcode ) 
#define SR fl4_fl9( opcode ) \ 

opcode fl9, (FPR SAVE_OFF + 12*8) (sp) ; \ 
SR f 14 f 18 ( opcode ) 
#define SR f 14_f 20 ( opcode ) \ 

opcode f20, (FPR SAVE_OFF + 11*8) (sp) ; \ 
SR f 14 f 19 ( opcode ) 
ttdefine SR fl4_f21( opcode ) \ 

opcode f21, (FPR SAVE_OFF + 10*8) (sp) ; \ 
SR fl4 f20( opcode ) 
, .. #def ine SR f 14_f 22 ( opcode ) \ 

opcode f22, (FPR SAVE_OFF + 9*8) (sp) ; \ 
SR fl4 f21( opcode ) 
r! ttdefine SR f 14_f 23 ( opcode ) \ 

» opcode f23, (FPR SAVE_OFF + 8*8) (sp) ; \ 

*?■ SR f 14 f 22 ( opcode ) 

m #define SR f 14_f 24 ( opcode ) \ 

opcode f24, (FPR SAVE_OFF + 7*8) (sp) ; \ 
SR f 14 f 23 ( opcode ) 
!ST ttdefine SR fl4_f25( opcode ) \ 

IP opcode f25, (FPR SAVE_OFF + 6*8) (sp) ; \ 

* SR f 14 f 24 ( opcode ) 

j*5 #define SR fl4_f26( opcode ) \ 

opcode f26, (FPR SAVE_OFF + 5*8) (sp) ; \ 
W SR fl4 f25( opcode ) 

Uh ttdefine SR fl4_f27( opcode ) \ 

V- ; opcode f27, (FPR SAVE_OFF + 4*8) (sp) ; \ 

SR fl4 f26( opcode ) 
O ttdefine SR f 14_f 28 ( opcode ) \ 

ps Opcode f28, (FPR SAVE_OFF + 3*8) (sp) ; \ 

SR fl4 f27( opcode ) 
ttdefine SR fl4_f29( opcode ) \ 

opcode f29, (FPR SAVE_OFF + 2*8) (sp) ; \ 
SR f 14 f 28 ( opcode ) 
ttdefine SR fl4_f30( opcode ) \ 

opcode f30, (FPR SAVE_OFF + 1*8) (sp); \ 
SR fl4 f29( opcode ) 
ttdefine SR fl4_f31( opcode ) \ 

opcode f31, (FPR SAVE_OFF) ( sp) ; \ 
SR_f 14_f 30 ( opcode ) 

^* macros for saving and restoring non-volatile 
* general purpose registers (GPRs) 
*/ 

#if defined ( VOLATILE_rl3 ) 
ttdefine SAVE rl3 

ttdefine SAVE rl3 rl4 SR rl4 ( stw ) 
ttdefine SAVE rl3 rl5 SR rl4 rl5 ( stw ) 
ttdefine SAVE rl3 rl6 SR rl4 rl6 ( stw ) 
ttdefine SAVE rl3 rl7 SR rl4 rl7 ( stw ) 
ttdefine SAVE rl3 rl8 SR rl4 rl8 ( stw ) 
ttdefine SAVE rl3 rl9 SR rl4 rl9( stw ) 
ttdefine SAVE_rl3_r20 SR_rl4_r20( stw ) 
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#define REST rl3 r26 SR rl3 r26 ( lwz ) 

ttdefine REST rl3 r27 SR rl3 r27 ( lwz ) 

#def ine REST rl3 r28 SR rl3 r28 ( lwz ) 

ttdefine REST rl3 r29 SR rl3 r29( lwz ) 

ttdefine REST rl3 r30 SR rl3 r30 ( lwz ) 

ttdefine REST_rl3_r31 SR_rl3_r31( lwz ) 

/* 

* macros common to both GPR save and resto: 

ttdef ine SR rl3 ( opcode ) \ 

opcode rl3, (GPR_SAVE OFF + 18*4) (sp) ; 
#def ine SR rl3_r!4 ( opcode ) \ 

opcode rl4, (GPR_SAVE_OFF + 17*4) (sp) ; \ 

SR rl3 ( opcode ) 
#define SR rl3_rl5 ( opcode ) \ 

opcode rl5, (GPR SAVE_OFF + 16*4) (sp) ; \ 

SR rl3 rl4 ( opcode ) 
#define SR rl3_rl6( opcode ) \ 

opcode rl6, (GPR SAVE_OFF + 15*4) (sp) ; \ 

SR rl3 rl5 ( opcode ) 
#define SR rl3_rl7 ( opcode ) \ 

opcode rl7, (GPR SAVE_OFF + 14*4) (sp) ; \ 

SR rl3 rl6 ( opcode ) 
#define SR rl3_rl8 ( opcode ) \ 

opcode rl8, (GPR SAVE_OFF + 13*4) (sp) ; \ 

SR rl3 rl7 ( opcode ) 
#define SR rl3_rl9( opcode ) \ 

opcode rl9, (GPR SAVE_0FF + 12*4) (sp) ; \ 

SR rl3 rl8 ( opcode ) 
ttdefine SR rl3_r20( opcode ) \ 

opcode r20, (GPR SAVE_OFF + 11*4) (sp) ; \ 

SR rl3 rl9 ( opcode ) 
ttdefine SR rl3_r21( opcode ) \ 

opcode r21, (GPR SAVE_0FF + 10*4) (sp) ; \ 

SR rl3 r20 ( opcode ) 
#def ine SR rl3_r22 ( opcode ) \ 

opcode r22, (GPR SAVE_0FF + 9*4) (sp) ; \ 

SR rl3 r21 ( opcode ) 
ttdefine SR rl3_r23 ( opcode ) \ 

opcode r23, (GPR SAVE_0FF + 8*4) (sp) ; \ 

SR rl3 r22 ( opcode ) 
ttdefine SR rl3_r24 ( opcode ) \ 

opcode r24, (GPR SAVE__0FF + 7*4) (sp); \ 

SR rl3 r23 ( opcode ) 
ttdefine SR rl3_r2 5( opcode ) \ 

opcode r25, (GPR SAVE_OFF + 6*4) (sp) ; \ 

SR rl3 r24 ( opcode ) 
ttdefine SR rl3_r26( opcode ) \ 

opcode r26, (GPR SAVE_0FF + 5*4) (sp); \ 

SR rl3 r25 ( opcode ) 
ttdefine SR rl3_r27 ( opcode ) \ 

opcode r27, (GPR SAVE_0FF + 4*4) (sp) ; \ 

SR rl3 r26 ( opcode ) 
ttdefine SR rl3_r28 ( opcode ) \ 

opcode r28, (GPR SAVE_0FF + 3*4) (sp) ; \ 

SR rl3 r27 ( opcode ) 
ttdefine SR rl3_r29( opcode ) \ 

opcode r2 9, (GPR SAVE_OFF + 2*4) (sp) ; \ 

SR rl3 r28( opcode ) 
ttdefine SR rl3_r3 0( opcode ) \ 

opcode r30, (GPR SAVE_OFF + 1*4) (sp) ; \ 

SR rl3 r29( opcode ) 
ttdefine SR rl3_r31( opcode ) \ 

opcode r31, (GPR SAVE_OFF) (sp) ; \ 
SR_rl3_r30( opcode ) 
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flendif 
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/* end VOLATILE_rl3 



* macros common to both GPR save and restore 
*/ 

#define SR rl4 { opcode ) \ 

opcode rl4, (GPR_SAVE OFF + 17*4) (sp) ; 
#define SR rl4_rl5( opcode ) \ 

opcode rl5, (GPR_SAVE_OFF + 16*4) (sp) ; \ 

SR rl4 ( opcode ) 
ttdefine SR rl4_rl6 ( opcode ) \ 

opcode rl6, (GPR SAVE_0FF + 15*4) (sp) ; \ 

SR rl4 rl5( opcode ) 
ttdefine SR rl4_rl7 ( opcode ) \ 

opcode rl7, (GPR SAVE_OFF + 14*4) (sp) ; \ 

SR rl4 rl6( opcode ) 
ttdefine SR rl4_rl8 ( opcode ) \ 

opcode rl8, (GPR SAVE_OFF + 13*4) (sp) ; \ 

SR rl4 rl7 ( opcode ) 
ttdefine SR rl4_rl9( opcode ) \ 

opcode rl9, (GPR SAVE_OFF + 12*4) (sp) ; \ 

SR rl4 rl8 ( opcode ) 
ttdefine SR rl4_r20 ( opcode ) \ 

opcode r20, (GPR SAVE_0FF + 11*4) (sp) ; \ 

SR rl4 rl9( opcode ) 
ttdefine SR rl4_r21 ( opcode ) \ 

opcode r21, (GPR SAVE_0FF + 10*4) (sp) ; \ 

SR rl4 r2 0( opcode ) 
ttdefine SR_rl4_r22 ( opcode ) \ 
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opcode r22, 
SR rl4 r21{ 

#define SR rl4_ 
opcode r23, 
SR rl4 r22( 

#define SR rl4_ 
opcode r24, 
SR rl4 r23 ( 

ttdefine SR rl4_ 
opcode r25, 
SR rl4 r24 ( 

#define SR rl4_ 
opcode r26, 
SR rl4 r25( 

^define SR rl4_ 
opcode r27, 
SR rl4 r26 ( 

^define SR rl4_ 
opcode r28, 
SR rl4 r27( 

#define SR rl4_ 
opcode r2 9, 
SR rl4 r28 ( 

#define SR rl4_ 
Q opcode r30, 

SR rl4 r2 9{ 
^ #define SR rl4_ 

'!? opcode r31 ( 

* SR_rl4_r3 0( 



(GPR SAVE_OFF + 9*4) (sp) ; \ 
opcode ) 

r23 ( opcode ) \ 

(GPR SAVE_0FF + 8*4) (sp); \ 
opcode ) 

r24 ( opcode ) \ 

(GPR SAVE_OFF + 7*4) (sp) ; \ 
opcode ) 

r2 5 ( opcode ) \ 

"(GPR SAVE_OFF + 6*4) (sp) ; \ 
opcode ) 

r26 ( opcode ) \ 

"(GPR SAVE_0FF + 5*4) (sp); \ 
opcode ) 

r27 ( opcode ) \ 

"(GPR SAVE_0FF + 4*4) (sp) ; \ 
opcode ) 

r2 8 ( opcode ) \ 

" (GPR SAVE_OFF + 3*4) (sp) ; \ 
opcode ) 

r29 ( opcode ) \ 

"(GPR SAVE_OFF + 2*4) (sp) ; \ 



e ) 

r3 0 ( opcode ) \ 
(GPR SAVE_OFF + 1*4) (sp) ; 
opcode ) 
r31 ( opcode ) \ 
'(GPR SAVE_OFF) (sp) ; \ 
opcode ) 



#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#define 
#define 
#def ine 
#def ine 
#define 
#define 
#def ine 
#def ine 

#def ine 
#def ine 
#def ine 
#def ine 
#define 
#define 
#def ine 
#def ine 
#def ine 
#define 
#define 
#def ine 
#define 
#def ine 
#def ine 
#define 
#def ine 



SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE rl5 
SAVE_rl5_ 

REST rl5 
REST rl5 
REST rl5 
REST rl5 
REST rl5 
REST rl5 
REST r!5 
REST rl5 
REST rl5 
REST rl5 
REST rl5 
REST rl5 
REST rl5 
REST rl5 
REST rl5 
REST rl5 
REST r!5 



SR_rl5( stv 
Tl6 SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR rl5 
SR_rl5 



' ) 

r!6 ( stw ) 

rl7( stw ) 

rl8 ( stw ) 

rl9 ( stw ) 

r20( stw ) 

r21( stw ) 

r22 ( stw ) 

r23 ( stw ) 

r24 ( stw ) 

r25 ( stw ) 

r26 ( stw ) 

r27 ( stw ) 

r28( stw ) 

r29( stw ) 

r30( stw ) 

r31 ( stw ) 



rl8 
rl9 
r2 0 
r21 
r2 2 
r23 
r24 
r25 
r26 
r2 7 
r28 
r29 
r30 
r31 



( lwz ) 
rl5 rl6 ( 
. rl5 rl7 ( 
. rl5 rl8 ( 
rl5 rl9( 
. rl5 r20( 
. rl5 r21( 
. rl5 r22 ( 
. r!5 r23 ( 
. rl5 r24 ( 
: rl5 r25( 
: rl5 r26 ( 
: rl5 r27( 
1 rl5 r28( 
i rl5 r29( 
L rl5 r30( 
• rl5 r31( 



lwz ) 

lwz ) 
lwz ) 
lwz ) 
lwz ) 

lwz ) 
lwz ) 
lwz ) 
lwz ) 
lwz ) 
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* macros common to both GPR save and restore 

#def ine SR rl5 ( opcode ) \ 

opcode rl5, (GPR_SAVE OFF + 16*4) (sp); 
^define SR rl5_rl6 ( opcode ) \ 

opcode Tl6, (GPR_SAVE_OFF + 15*4) (sp) ; \ 

SR rl5( opcode ) 
^define SR rl5_rl7 ( opcode ) \ 

opcode rl7, (GPR SAVE_0FF + 14*4) (sp) ; \ 

SR rl5 rl6( opcode ) 
#def ine SR rl5_rl8 ( opcode ) \ 

opcode rl8, (GPR SAVE_OFF + 13*4) (sp) ; \ 

SR rl5 rl7 ( opcode ) 
#define SR rl5_rl9( opcode ) \ 

opcode rl9, (GPR SAVE_OFF + 12*4) (sp) ; \ 

SR rl5 rl8 ( opcode ) 
#define SR rl5_r20( opcode ) \ 

opcode r20, (GPR SAVE_OFF + 11*4) (sp); \ 

SR rl5 rl9( opcode ) 
#define SR rl5_r21( opcode ) \ 

opcode r21, (GPR SAVE_OFF + 10*4) (sp); \ 

SR rl5 r2 0( opcode ) 
#define SR rl5_r22 ( opcode ) \ 

opcode r22, (GPR SAVE_OFF + 9*4) (sp) ; \ 

SR rl5 r21 ( opcode ) 
#def ine SR rl5_r23 ( opcode ) \ 

opcode r2 3, (GPR SAVE_OFF + 8*4) (sp); \ 

SR rl5 r22 ( opcode ) 
#def ine SR rl5_r24 ( opcode ) \ 

opcode r24, (GPR SAVE_OFF + 7*4) (sp) ; \ 

SR rl5 r23 ( opcode ) 
#define SR rl5_r2 5( opcode ) \ 

opcode r25, (GPR SAVE_0FF + 6*4) (sp) ; \ 

SR rl5 r24 ( opcode ) 
#define SR rl5_r26( opcode ) \ 

opcode r26, (GPR SAVE_OFF + 5*4) (sp) ; \ 

SR rl5 r25( opcode ) 
#define SR rl5_r27{ opcode ) \ 

opcode r27, (GPR SAVE_OFF + 4*4) (sp) ; \ 

SR rl5 r26( opcode ) 
ttdefine SR rl5_r28( opcode ) \ 

opcode r28, (GPR SAVE_OFF + 3*4) (sp); \ 

SR rl5 r27( opcode ) 
ttdefine SR rl5_r2 9( opcode ) \ 

opcode r2 9, (GPR SAVE_0FF + 2*4) (sp) ; \ 

SR rl5 r28 ( opcode ) 
ttdefine SR rl5_r3 0( opcode ) \ 

opcode r30, (GPR SAVE_OFF + 1*4) (sp); \ 

SR rl5 r2 9 ( opcode ) 
ttdefine SR rl5_r31( opcode ) \ 

opcode r31, (GPR SAVE_0FF) (sp) ; \ 

SR_rl5_r3 0 ( opcode ) 

#def ine SAVE rl6 SR_rl6 ( stw ) 

#def ine SAVE rl6 rl7 SR rl6 rl7 ( stw ) 

#def ine SAVE rl6 rl8 SR rl6 rl8 ( stw ) 

#define SAVE rl6 rl9 SR rl6 rl9( stw ) 

#define SAVE rl6 r20 SR rl6 r20 ( stw ) 

#define SAVE rl6 r21 SR rl6 r21 ( stw ) 

#define SAVE rl6 r22 SR rl6 r22 ( stw ) 

ttdefine SAVE rl6 r23 SR rl6 r23( stw ) 

#def ine SAVE rl6 r24 SR rl6 r24 ( stw ) 

#define SAVE rl6 r25 SR rl6 r25 ( stw ) 

#define SAVE rl6 r26 SR rl6 r26 ( stw ) 

ttdefine SAVE rl6 r27 SR rl6 r27 ( stw ) 

ttdefine SAVE rl6 r28 SR rl6 r28 ( stw ) 

ttdefine SAVE_rl6_r29 SR_rl6_r2 9 ( stw ) 
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#def ine 


SAVE 




30 


#def ine 


SAVE 


r fi 

- r - 


- 


#def ine 


REST 






#def ine 


REST 


r " 


r 


#def ine 


REST 


T \ 




#def ine 


REST 


rlfe 


1 Q 


#def ine 


REST 


rl6 




#def ine 


REST 






ttdefine 


REST 


rl6 


r22 


#def ine 


REST 


rl6 


r23 


ttdef ine 


REST 


rl6 


r24 


#def ine 


REST 


rl6 


r25 


#def ine 


REST 


rl6 


r26 


^define 


REST 


rl6 


r2 7 


fl define 


REST 


rl6 


r28 


fldef ine 


REST 


rl6 


r2 9 


^define 


REST 


rl6 


r30 


^define 


REST 


rl6 


r31 



SR rl6 rl7( lwz ) 

SR rl6 rl8( lwz ) 

SR rl6 rl9( lwz ) 

SR rl6 r20 ( lwz ) 

SR rl6 r21( lwz ) 

SR r!6 r22 ( lwz ) 

SR rl6 r23 < lwz ) 

SR rl6 r24( lwz ) 

SR rl6 r25( lwz ) 

SR rl6 r26( lwz ) 

SR rl6 r27{ lwz ) 

SR rlS r28 ( lwz ) 

SR rl6 r29( lwz ) 

SR rl6 r30( lwz ) 

SR_rl6_r31( lwz ) 

^* macros common to both GPR save and restore 

#define SR rl6 ( opcode ) \ 

opcode rl6, (GPR_SAVE OFF + 15*4) (sp); 
#define SR rlS_rl7( opcode ) \ 

opcode rl7, (GPR_SAVE_OFF + 14*4) (sp); \ 

SR rl6 ( opcode ) 
#define SR rl6_rl8 ( opcode ) \ 

opcode rl8, (GPR SAVE_OFF + 13*4) (sp) ; \ 

SR rl6 rl7( opcode ) 
#define SR rl6_rl9( opcode ) \ 

opcode rl9, (GPR SAVE_OFF + 12*4) (sp) ; \ 

SR rl6 rl8( opcode ) 
#define SR rl6_r20( opcode ) \ 

opcode r20, (GPR SAVE_0FF + 11*4) (sp) ; \ 

SR rl6 rl9( opcode ) 
#define SR rl6_r21 ( opcode ) \ 

opcode r21, (GPR SAVE_OFF + 10*4) (sp); \ 

SR rl6 r20( opcode ) 
#def ine SR rl6_r22 ( opcode ) \ 

opcode r22, (GPR SAVE_OFF + 9*4) (sp); \ 

SR rl6 r21 ( opcode ) 
#define SR rl6_r23 ( opcode ) \ 

opcode r23, (GPR SAVE_OFF + 8*4) (sp); \ 

SR rl6 r22 ( opcode ) 
#define SR rl6_r24 ( opcode ) \ 

opcode r24, (GPR SAVE_0FF + 7*4) (sp) ; \ 

SR rl6 r23 ( opcode ) 
#define SR rl6_r25( opcode ) \ 

opcode r2S, (GPR SAVE_0FF + 6*4) (sp); \ 

SR rl6 r24 ( opcode ) 
ttdefine SR rl6_r26 ( opcode ) \ 

opcode r2 6, (GPR SAVE_0FF + 5*4) (sp); \ 

SR rl6 r25 ( opcode ) 
#define SR rl6_r27 ( opcode ) \ 

opcode r27, (GPR SAVE_OFF + 4*4) (sp) ; \ 

SR rl6 r26 ( opcode ) 
#define SR rl6_r28 ( opcode ) \ 

opcode r28, (GPR SAVE_OFF + 3*4) (sp); \ 

SR rl6 r27( opcode ) 
#define SR rl6_r29( opcode ) \ 

opcode r29, (GPR SAVE_OFF + 2*4) (sp) ; \ 

SR rl6 r28 ( opcode ) 
ttdefine SR rl6_r30( opcode ) \ 

opcode r30, (GPR SAVE_OFF + 1*4) (sp) ; \ 

SR_rl6_r29( opcode ) 
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^define SR rl6_r31 ( opcode ) \ 

opcode r31, (GPR SAVE_OFF) (sp) ; \ 
SR_rl6_r30( opcode ) 

#if defined ( BUILD_MAX ) 

* macros for saving and restoring non-volatile 

* vector registers (VRs) 

* (uses rO as scratch register) 

#define SAVE v20 SR_v2 0{ stvx ) 

#define SAVE v20 v21 SR v20 v21( stvx ) 

#define SAVE v20 v22 SR v20 v22 ( stvx ) 

#define SAVE v20 v23 SR v20 v23 ( stvx ) 

#define SAVE v20 v24 SR v20 v24 ( stvx ) 

ttdefine SAVE v20 v25 SR v20 v25( stvx ) 

#define SAVE v20 v26 SR v20 v26 ( stvx ) 

#define SAVE v20 v27 SR v20 v27{ stvx ) 

#define SAVE v20 v28 SR v20 v28( stvx ) 

ttdefine SAVE v20 v29 SR v20 v29( stvx ) 

ttdefine SAVE v20 v30 SR v20 v30( Stvx ) 

#define SAVE_v20_v31 SR_v2 0_v31( stvx ) 

#define REST v20 SR_v2 0( lvx ) 

ttdefine REST v20 v21 SR v20 v21 ( lvx ) 

ttdefine REST v20 v22 SR v20 v22 ( lvx ) 

#define REST v20 v23 SR v20 v23 ( lvx ) 

#define REST v20 v24 SR v20 v24 ( lvx ) 

ttdefine REST v20 v25 SR v20 v25 ( lvx ) 

ttdefine REST v20 v2 6 SR v20 v26 ( lvx ) 

#define REST v20 v27 SR v20 v27 ( lvx ) 

#define REST v20 v28 SR v20 v28 ( lvx ) 

ttdefine REST v20 v29 SR v20 v29 ( lvx ) 

ttdefine REST v20 v30 SR v20 v30( lvx ) 

ttdefine REST_v20_v31 SR_v20_v31 ( lvx ) 



* macros common to both VR save and restore 

* (uses rO as scratch register) 

ttdefine SR v20 ( opcode ) \ 

li rO, (VR SAVE_OFF + 11*16) ; \ 

opcode v20, sp, rO; 
ttdefine SR v2 0 v21( opcode ) \ 

li rO, (VR SAVE_OFF + 10*16) ; \ 

opcode v21, sp, rO; \ 

SR v20 ( opcode ) 
ttdefine SR v20 v22 ( opcode ) \ 

li rO, (VR SAVE_OFF + 9*16); \ 

opcode v22, sp, rO; \ 

SR v2 0 v21( opcode ) 
ttdefine SR v20 v23 ( opcode ) \ 

li rO, (VR SAVE_0FF + 8*16); \ 

opcode v23, sp, rO ; \ 

SR v20 v22 ( opcode ) 
ttdefine SR v20 v24 ( opcode ) \ 

li rO, (VR SAVE_OFF + 7*16); \ 

opcode v24, sp, rO ; \ 

SR v2 0 v23( opcode ) 
ttdefine SR v20 v25( opcode ) \ 

li rO, (VR SAVE_OFF + 6*16); \ 

opcode v2 5, sp, rO ; \ 

SR v20 v24 ( opcode ) 
ttdefine SR v20 v26( opcode ) \ 

li rO, (VR SAVE_OFF + 5*16) ; \ 

opcode v2 6, sp, rO ; \ 
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SR v20 v25( opcode ) 
fldefine SR v20 v27 ( opcode ) \ 

li rO, (VR SAVE_OFF + 4*16); \ 

opcode v27, sp, rO ; \ 

SR v2 0 v2 6( opcode ) 
^define SR v20 v28 ( opcode ) \ 

li rO, (VR SAVE_OFF + 3*16); \ 

opcode v28, sp, rO; \ 

SR v2 0 v27( opcode ) 
#define SR v20 v29< opcode ) \ 

li rO, (VR SAVE_OFF + 2*16); \ 

opcode v2 9, sp, rO; \ 

SR v2 0 v28 ( opcode ) 
#define SR v20 v30( opcode ) \ 

li rO, (VR SAVE_OFF + 1*16); \ 

opcode v3 0, sp, rO; \ 

SR v2 0 v29( opcode ) 
#define SR v20 v31( opcode ) \ 

li rO, (VR SAVE_OFF) ; \ 

opcode v31, sp, rO; \ 

SR_v2 0_v3 0 ( opcode ) 

^ * macros for saving, updating and restoring VRSAVE and saving and 

* restoring non-volatile vector registers (vO - v31) 

* (destroys rO and CRO field of CR) 

#define NON VOLATILE VR TEST ( last vreg ) \ 

andi. rO, rO, ((-1 << (31 - (last_vreg) ) ) & OxOff f ) ; 

#def ine RECORD vO vl5 ( last_vreg ) \ 

oris rO, rO, ((-1 << (15 - (last_vreg) ) ) & Oxffff ) ; \ 
mtspr % VRSAVE, rO; 

#define RECORD vl6 v31( last_vreg ) \ 
oris rO, rO, Oxffff; \ 

ori rO, rO , ((-1 << (31 - (last_vreg) ) ) & Oxffff); \ 
mtspr % VRSAVE, rO; 

#define USE vO vl5 ( cond, last_vreg ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC OFFSET ( 8 ); \ 
stw rO, VRSAVE_SAVE OFF(sp); \ 
RECORD_vO_vl5 ( last_vreg ) 

#define USE vl6 vl9 ( cond, last_vreg ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi ( cond) , rO , 0 ; \ 
beq (cond), PC OFFSET ( 8 ) ; \ 
stw rO, VRSAVE SAVE OFF(sp); \ 
RECORD_vl6_v31 ( last_vreg ) 

#define FREE_vO_vl9 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET ( 8 ); \ 
lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr % VRSAVE, rO ; 

/* 

* user-callable macros 
*/ 

#def ine USE THRU vO ( cond ) 
#def ine USE THRU vl ( cond ) 
#def ine USE THRU v2 ( cond ) 
#def ine USE THRU v3 ( cond ) 
#def ine USE_THRU_v4 ( cond ) 



USE 


vO 


vl5 ( 




0 


USE 


vO 


vl5< 


cond. 


1 


USE 


vO 


vl5 ( 




2 


USE 


vO 


vl5( 


cond. 


3 


USE 


vO 


vl5( 


cond, 


4 
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P 

o 



#define 
#define 
#define 
#define 
#def ine 
#define 
#define 
#define 
#define 
#define 
#define 
ttdefine 
#def ine 
ttdefine 
#def ine 



THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 



v5 ( cond ) 
v6 ( cond ) 
v7 ( cond } 
v8 ( cond ) 
v9 ( cond ) 
vlO ( cond ) 
vll ( cond ) 
vl2 ( cond ) 
vl3 ( cond ) 
vl4 { cond ) 
vl5 ( cond ) 
vl6 { cond ) 
vl7 ( cond ) 
vl8 ( cond ) 
_vl 9 ( cond ) 



USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 
USE 



vO vl5 ( 
vO vl5 ( 
vO vl5 ( 
vO vl5 ( 
vO vl5 ( 
vO vl5( 
vO vl5( 
vO vl5( 
vO vl5( 
vO vl5( 
vO vl5( 
vl6 vl9 
vl6 vl9 
V16 vl9 
vl6_vl9 



cond, 5 ) 
cond, 6 ) 
cond, 7 ) 
cond, 8 ) 
cond, 9 ) 
cond, 10 ) 
cond, 11 ) 
cond, 12 ) 
cond, 13 ) 
cond, 14 ) 
cond, 15 ) 
( cond, 16 ) 
( cond, 17 ) 
( cond, 18 ) 
cond, 19 ) 



#define USE THRU v20 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET ( 32 ) ,- 

stw rO, VRSAVE SAVE OFF(sp); ^ 
NONVOLATILE VR TEST ( 20 ) 
beq PC_OFFSET(16) ; 
\ 

SAVE v20 

cmpwi (cond), rO , 0x7fff; 
mfspr rO, % VRSAVE; 
RECORD_vl6_v31 ( 2 0 ) 

ttdefine USE THRU v21 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (40) ; 

stw rO, VRSAVE SAVE OFF(sp); 1 
NONVOLATILE VR TEST ( 21 ) 
beq PC_0£TSET(24) ; 

\ 

SAVE v20 v21 

cmpwi (cond), rO , 0x7fff; 
mfspr rO, % VRSAVE; 
RECORD_vl6_v31 ( 21 ) 

#def ine USE THRU v22 ( cond ) \ 
mfspr rO, % VRSAVE ; \ 
cmplwi (cond), rO , 0; \ 
beq (cond), PC_OFFSET < 48 > ' 

stw rO, VRSAVE SAVE OFF(sp); 
NON_VOLA TILE VR TEST ( 22 ) 
beq PC_O ffset ( 32 > ' 
\ 

SAVE v20 V22 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 2 2 ) 

#def ine USE THRU v23 ( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_0 FFSET ( 5 6 ) ,- 

\ 

stw rO, VRSAVE SAVE OFF(sp); 
NON_V° LATILE w TEST ( 23 ) 
beq PC_O FFSET ( 40 > • 



I* cond set to equal if VRSAVE = 



/* v20 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v20 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v21 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v21 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v22 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v22 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v23 in use? */ \ 

/* no, cond is set to greater than */ 
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SAVE v20 v23 

cmpwi (cond) , rO, 0x7fff; 
mfspr rO, % VRSAVE ; 
RECORD_vl6_v31 { 23 ) 

#def ine USE THRU v24 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO , 0; \ 
beq (cond), PC_OFFSET (64) ; 

stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST { 24 ) 
beq PC_OFFSET(48) ; 
\ 

SAVE v2 0 v24 

cmpwi (cond), rO , 0x7fff; 
mfspr rO, % VRSAVE; 
RECORD_vl6_v31 ( 24 ) 

#def ine USE THRU v25 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (72) ; 

stw rO, VRSAVE SAVE OFF(sp) ; \ 
NONVOLATILE VR TEST { 25 ) 
beq PC_OFFSET(56) ; 
\ 

SAVE v20 v25 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, % VRSAVE ; 
RECORD_vl6_v31 ( 25 ) 

#define USE THRU v26 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET ( 8 0 ) ; 

stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 26 ) 
beq PC_OFFSET ( 64 ) ; 

\ 

SAVE v2 0 v26 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, % VRSAVE ; 
RECORD_vl6_v31 ( 2 6 ) 

#def ine USE THRU v27 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (88) ; 

Stw rO, VRSAVE SAVE OFF(sp); \ 
NONVOLATILE VR TEST ( 27 ) 
beq PC_OFFSET(72) ,- 
\ 

SAVE v2 0 v2 7 

cmpwi (cond), rO , 0x7fff; 
mfspr rO, % VRSAVE; 
RECORD_vl6_v31 ( 2 7 ) 

#def ine USE THRU v28 ( cond ) \ 
mfspr rO, % VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET<96) ; 

stw rO, VRSAVE_SAVE_0 FF < s P) • \ 



/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v23 in use */ 



/* cond set to equal if VRSAVE 



v20 - v24 in use? */ \ 

no, cond is set to greater than */ 



/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v24 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v2 0 - v2 5 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v25 in use */ 



/* cond set to equal if VRSAVE = 0 */ 

/* v20 - v26 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v26 in use */ 



/* cond set to equal if VRSAVE = 0 */ 

/* v20 - v27 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v27 in use */ 



/* cond set to equal if VRSAVE = 
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NONVOLATILE VR TEST ( 2 8 ) 

beq PC_OFFSET(80) ; 

\ 

SAVE v20 v28 

cmpwi (cond) , rO , 0x7fff; 
mfspr rO, % VRSAVE; 
RECORD_vl6_v31 ( 2 8 ) 

#define USE THRU v29( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (104) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 29 ) 
beq PC_OFFSET(88) ; 

\ 

SAVE v2 0 v29 

cmpwi (cond), rO, 0x7fff; 
mfspr rO, %VRSAVE; 
RECORD_vl6_v31 ( 29 ) 

ttdefine USE THRU v30 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET (112) ; 
\ 

stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 30 ) 
beq PC_OFFSET(96) ; 

\ 

SAVE v20 v3 0 

cmpwi (cond) , rO, 0x7fff ; 
mfspr rO, % VRSAVE; 
RECORD_vl6_v31 ( 30 ) 

#define USE THRU v31 ( cond ) \ 
mfspr rO, %VRSAVE; \ 
cmplwi (cond), rO, 0; \ 
beq (cond), PC_OFFSET(120) ; 

\ 

stw rO, VRSAVE SAVE OFF(sp); 
NONVOLATILE VR TEST ( 31 ) 
beq PC_OFFSET(104) ; 
\ 

SAVE v20 v31 

cmpwi (cond) , rO, 0x7fff ; 
mfspr rO, %VRSAVE; 
RECORD__vl6_v31 ( 31 ) 



#def ine 
#def ine 
#define 
#def ine 
#define 
#define 
#def ine 
ttdefine 
#define 
#def ine 
#define 
#def ine 
#def ine 
#define 
#define 
#define 
#def ine 



FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 
FREE 



THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 
THRU 



vO ( cond ) 
vl ( cond ) 
v2 ( cond ) 
v3 ( cond ) 
v4 ( cond ) 
v5 ( cond ) 
v6 ( cond ) 
v7 ( cond ) 
v8 ( cond ) 
v9 ( cond ) 
vl 0 ( cond ) 
vll( cond ) 
vl2 ( cond ) 
vl3 ( cond ) 
vl4 ( cond ) 
vl5 ( cond ) 
_vl6 ( cond ) 



/* v20 - v28 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v28 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v2 0 - v2 9 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v29 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



/* v20 - v30 in use? */ \ 

/* no, cond is set to greater than */ 

/* leaves a negative value in rO */ \ 
/* cond is set to less than */ \ 
/* reload VRSAVE into rO */ \ 
/* indicate vO - v30 in use */ 



/* cond set to equal if VRSAVE = 0 */ 



v20 - v31 in use? */ \ 

no, cond is set to greater than */ 



/* leaves a negative value in rO 

/* cond is set to less than */ \ 

/* reload VRSAVE into rO */ \ 

/* indicate vO - v31 in use */ 



FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 
FREE vO vl9 ( 



cond ) 
cond ) 
cond ) 
cond ) 
cond ) 
cond ) 
cond ) 
cond ) 
cond ) 
cond ) 
cond ) 

cond ) 
cond ) 
cond ) 
cond ) 
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^define FREE THRU vl7 ( cond ) FREE vO vl9( cond ) 
^define FREE THRU vl8 ( cond ) FREE vO vl9( cond ) 
fldefine FREE_THRU_vl 9 ( cond ) FREE_vO_vl9 ( cond ) 

define FREE_THRU_v2 0 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (20) ; \ 

bgt (cond), PC_OFFSET (12) ; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v2 1 ( cond ) \ 
li rO, 0; \ 

beq (cond) , PC OFFSET (2 8) ; \ 
bgt (cond), PC OFFSET (20) ; \ 
REST v20 v21; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#def ine FREE_THRU_v2 2 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (35); \ 
bgt (cond), PC OFFSET (28); \ 
REST v20 v22; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#def ine FREE_THRU_v2 3 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET(44); \ 
bgt (cond), PC OFFSET (36); \ 
REST v20 v23; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, r0; 

#def ine FREE_THRU_v24 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (52) ; \ 
bgt (cond), PC OFFSET (44) ; \ 
REST v20 v24; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v25 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (60); \ 
bgt (cond), PC OFFSET (52); \ 
REST v2 0 v25; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO ; 

ttdefine FREE_THRU_v2 6 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (6 8) ; \ 
bgt (cond), PC OFFSET (60) ; \ 
REST v2 0 v2 6; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v2 7 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (76) ; \ 
bgt (cond), PC OFFSET (68) ; \ 
REST v2 0 v2 7; \ 

lwz rO, VRSAVE_SAVE_OFF (sp) ; \ 
mtspr %VRSAVE, rO; 
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#define FREE_THRU_v2 8 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (84) ,- \ 
bgt (cond) , PC OFFSET (76) ; \ 
REST v20 v2 8; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO ; 

#define FREE_THRU_v2 9 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (92); \ 
bgt (cond), PC OFFSET (84); \ 
REST v2 0 v2 9; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v3 0 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (10 0); \ 
bgt (cond), PC OFFSET ( 92 ) ; \ 
REST v2 0 v3 0; \ 

lwz rO, VRSAVE_SAVE_OFF(sp) ; \ 
mtspr %VRSAVE, rO; 

#define FREE_THRU_v31 ( cond ) \ 
li rO, 0; \ 

beq (cond), PC OFFSET (108) ; \ 
bgt (cond), PC OFFSET (100) ; \ 
REST v2 0 v31; \ 

lwz rO, VRSAVE_SAVE_OFF (sp) ; \ 
mtspr %VRSAVE, rO; 

#endif 

/* 



/* end BUILD_MAX */ 



to save and restore the CR register 
* (uses rO as scratch register) 
*/ 

#define SAVE CR \ 
mfcr rO; \ 

stw rO, CR_SAVE_OFF(sp) ; 

#define REST CR \ 

lwz rO, CR_SAVE_OFF(sp) ; \ 
mtcr rO ; 



* macros to save and restore the LR register 

* (uses rO as scratch register) 
*/ 

#define SAVE LR \ 
mflr rO; \ 

Stw rO, LR_SAVE_OFF(sp) ; 

#define REST LR \ 

lwz rO, LR_SAVE_OFF(sp) ; \ 
mtlr rO; 



#endif 
/* 
*/ 

/ 



/* end COMPILE_C */ 
macros for declaring GPR, FPR and VMX registers 



declare rO 
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*/ 

tt define DECLARE_r 0 
/* 

* r3 declare set 
*/ 

#define DECLARE r3 
#define DECLARE r3 r4 
#define DECLARE r3 r5 
#define DECLARE r3 r6 
#define DECLARE r3 r7 
#define DECLARE r3 r8 
#define DECLARE r3 r9 
#define DECLARE r3 rlO 
#define DECLARE r3 rll 
#define DECLARE r3 rl2 
ttdefine DECLARE r3 rl3 
#define DECLARE r3 rl4 
#define DECLARE r3 rl5 
#define DECLARE r3 rl6 
#define DECLARE r3 rl7 
ttdefine DECLARE r3 rl8 
#define DECLARE r3 rl9 
#define DECLARE r3 r20 
#define DECLARE r3 r21 
ttdefine DECLARE r3 r22 
ttdefine DECLARE r3 r23 
ttdefine DECLARE r3 r24 
ttdefine DECLARE r3 r25 
ttdefine DECLARE r3 r2 6 
ttdefine DECLARE r3 r2 7 
#define DECLARE r3 r2 8 
ttdefine DECLARE r3 r29 
ttdefine DECLARE r3 r3 0 
ttdefine DECLARE_r3_r3 1 

/* 

* r4 declare set 
*/ 

ttdefine DECLARE r4 
ttdefine DECLARE r4 r5 
ttdefine DECLARE r4 r6 
ttdefine DECLARE r4 r7 
ttdefine DECLARE r4 r8 
ttdefine DECLARE r4 r9 
ttdefine DECLARE r4 rlO 
ttdefine DECLARE r4 rll 
ttdefine DECLARE r4 rl2 
ttdefine DECLARE r4 rl3 
ttdefine DECLARE r4 rl4 
ttdefine DECLARE r4 rl5 
ttdefine DECLARE r4 rl6 
ttdefine DECLARE r4 rl7 
ttdefine DECLARE r4 rl8 
ttdefine DECLARE r4 rl9 
ttdefine DECLARE r4 r20 
ttdefine DECLARE r4 r21 
ttdefine DECLARE r4 r22 
ttdefine DECLARE r4 r23 
ttdefine DECLARE r4 r24 
ttdefine DECLARE r4 r25 
ttdefine DECLARE r4 r26 
ttdefine DECLARE r4 r27 
ttdefine DECLARE r4 r28 
ttdefine DECLARE r4 r29 
ttdefine DECLARE r4 r30 
ttdefine DECLARE_r4_r 3 1 
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/* 

* r5 declare set 
*/ 

#define DECLARE r5 
#define DECLARE r5 r6 
#define DECLARE r5 r7 
#define DECLARE r5 r8 
ttdefine DECLARE r5 r9 
#define DECLARE r5 rlO 
#define DECLARE r5 rll 
ttdefine DECLARE rS rl2 
#define DECLARE r5 rl3 
#define DECLARE r5 rl4 
ttdefine DECLARE r5 rl5 
ttdefine DECLARE r5 rl6 
ttdefine DECLARE r5 rl7 
#define DECLARE r5 rl8 
#define DECLARE r5 rl9 
ttdefine DECLARE r5 r2 0 
ttdefine DECLARE r5 r21 
ttdefine DECLARE r5 r22 
ttdefine DECLARE r5 r23 
ttdefine DECLARE r5 r24 
ttdefine DECLARE r5 r25 
ttdefine DECLARE r5 r26 
ttdefine DECLARE r5 r27 
ttdefine DECLARE r5 r28 
ttdefine DECLARE r5 r29 
ttdefine DECLARE r5 r3 0 
ttdefine DECLARE_r5_r3 1 

/* 

* r6 declare set 
*/ 

ttdefine DECLARE r6 
ttdefine DECLARE r6 r7 
ttdefine DECLARE rS r8 
ttdefine DECLARE r6 r9 
4* ttdefine DECLARE r6 rlO 

O ttdefine DECLARE r6 rll 

?5= ttdefine DECLARE r6 rl2 

i¥ ttdefine DECLARE r6 rl3 

ttdefine DECLARE r6 rl4 
ttdefine DECLARE r6 rl5 
ttdefine DECLARE r6 rl6 
ttdefine DECLARE r6 rl7 
ttdefine DECLARE r6 rl8 
ttdefine DECLARE r6 rl9 
ttdefine DECLARE r6 r20 
ttdefine DECLARE r6 r21 
ttdefine DECLARE r6 r22 
ttdefine DECLARE r6 r23 
ttdefine DECLARE r6 r24 
ttdefine DECLARE r6 r25 
ttdefine DECLARE r6 r26 
ttdefine DECLARE r6 r27 
ttdefine DECLARE r6 r28 
ttdefine DECLARE r6 r2 9 
ttdefine DECLARE r6 r3 0 
ttdefine DECLARE_r 6_r 3 1 

/* 

* r7 declare set 
*/ 

ttdefine DECLARE r7 
ttdefine DECLARE_r7_r8 



Page No. 396 



EV 093 931 868 US 



Page No. 423 

salppc . inc 

#define DECLARE r7 r9 
#define DECLARE r7 rlO 
ttdefine DECLARE r7 rll 
ttdefine DECLARE r7 rl2 
ttdefine DECLARE r7 rl3 
ttdefine DECLARE r7 rl4 
ttdefine DECLARE r7 rl5 
ttdefine DECLARE r7 rl6 
ttdefine DECLARE r7 rl7 
ttdefine DECLARE r7 rl8 
ttdefine DECLARE r7 rl9 
ttdefine DECLARE r7 r2 0 
ttdefine DECLARE r7 r21 
ttdefine DECLARE r7 r22 
ttdefine DECLARE r7 r23 
ttdefine DECLARE r7 r24 
ttdefine DECLARE r7 r25 
ttdefine DECLARE r7 r26 
ttdefine DECLARE r7 r27 
ttdefine DECLARE r7 r28 
ttdefine DECLARE r7 r29 
ttdefine DECLARE r7 r3 0 
ttdefine DECLARE_r7_r3 1 

/* 

* r8 declare set 

DECLARE r8 
DECLARE r8 r9 
DECLARE r8 rlO 
DECLARE r8 rll 
DECLARE r8 rl2 
DECLARE r8 rl3 
DECLARE r8 rl4 
DECLARE r8 rl5 
DECLARE r8 rl6 
DECLARE r8 rl7 
DECLARE r8 rl8 
DECLARE r8 rl9 
DECLARE r8 r20 
DECLARE r8 r21 
DECLARE r8 r22 
DECLARE r8 r23 
DECLARE r8 r24 
DECLARE r8 r25 
DECLARE r8 r26 
DECLARE r8 r27 
DECLARE r8 r28 
DECLARE r8 r2 9 
DECLARE r8 r30 
DECLARE_r 8_r 3 1 

/* 

* r9 declare set 
*/ 

ttdefine DECLARE r9 
ttdefine DECLARE r9 rlO 
ttdefine DECLARE r9 rll 
ttdefine DECLARE r9 rl2 
ttdefine DECLARE r9 rl3 
ttdefine DECLARE r9 rl4 
ttdefine DECLARE r9 rl5 
ttdefine DECLARE r9 rl6 
ttdefine DECLARE r9 rl7 
ttdefine DECLARE r9 rl8 
ttdefine DECLARE r9 rl9 
ttdefine DECLARE_r9_r2 0 



%■ ttdefine 
CI ttdefine 
,.Q ttdefine 
■~ ttdefine 
W ttdefine 
III ttdefine 
j ttdefine 
,=(.; ttdefine 
ttdefine 
ili ttdefine 
U ttdefine 
> ttdefine 
m ttdefine 
O ttdefine 
ttdefine 
5W ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
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#define DECLARE r9 r21 
#define DECLARE r9 r22 
#define DECLARE r9 r23 
ttdefine DECLARE r9 r24 
ttdefine DECLARE r9 r25 
ttdefine DECLARE r9 r26 
ttdefine DECLARE r9 r27 
ttdefine DECLARE r9 r28 
ttdefine DECLARE r9 r29 
ttdefine DECLARE r9 r30 
tt define DECLARE_r9_r31 



#define DECLARE rlO 
#define DECLARE rlO rll 
#define DECLARE rlO rl2 
#define DECLARE rlO rl3 
#define DECLARE rlO rl4 
#define DECLARE rlO rl5 
#define DECLARE rlO rl6 
#define DECLARE rlO rl7 
#define DECLARE rlO rl8 
#define DECLARE rlO rl9 
#define DECLARE rlO r20 
#define DECLARE rlO r2l 
#define DECLARE rlO r22 
#define DECLARE rlO r23 
#define DECLARE rlO r24 
#define DECLARE rlO r25 
#define DECLARE rlO r26 
ttdefine DECLARE rlO r27 
#define DECLARE rlO r28 
ttdefine DECLARE rlO r29 
ttdefine DECLARE rlO r30 
#define DECLARE_rlO_r31 



* rll declare set 
*/ 

ttdefine DECLARE rll 
ttdefine DECLARE rll rl2 
ttdefine DECLARE rll rl3 
ttdefine DECLARE rll rl4 
ttdefine DECLARE rll rl5 
ttdefine DECLARE rll rl6 
ttdefine DECLARE rll rl7 
ttdefine DECLARE rll rl8 
ttdefine DECLARE rll rl9 
ttdefine DECLARE rll r20 
ttdefine DECLARE rll r21 
ttdefine DECLARE rll r22 
ttdefine DECLARE rll r23 
ttdefine DECLARE rll r24 
ttdefine DECLARE rll r25 
ttdefine DECLARE rll r26 
ttdefine DECLARE rll r27 
ttdefine DECLARE rll r28 
ttdefine DECLARE rll r29 
ttdefine DECLARE rll r30 
ttdefine DECLARE_rll_r31 

/* 

* rl2 declare set 
*/ 

ttdefine DECLARE_rl2 
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^define DECLARE rl2 rl3 
^define DECLARE rl2 rl4 
^define DECLARE rl2 rl5 
#define DECLARE rl2 rl6 
#define DECLARE rl2 rl7 
#define DECLARE rl2 rl8 
#define DECLARE rl2 rl9 
#define DECLARE rl2 r2 0 
#define DECLARE rl2 r21 
#define DECLARE rl2 r22 
#define DECLARE rl2 r23 
#define DECLARE rl2 r24 
#define DECLARE rl2 r25 
#define DECLARE rl2 r26 
#define DECLARE rl2 r27 
#define DECLARE rl2 r28 
#define DECLARE rl2 r29 
#define DECLARE rl2 r3 0 
ttdefine DECLARE_rl2_r31 

/* 

* rl3 declare set 
*/ 

ttdefine DECLARE rl3 
#define DECLARE rl3 rl4 
ttdefine DECLARE rl3 rl5 
ttdefine DECLARE rl3 rl6 
#define DECLARE rl3 rl7 
ttdefine DECLARE rl3 rl8 
ttdefine DECLARE rl3 rl9 
ttdefine DECLARE rl3 r20 
ttdefine DECLARE rl3 r21 
#define DECLARE rl3 r22 
#define DECLARE rl3 r23 
#define DECLARE rl3 r24 
#define DECLARE rl3 r25 
#define DECLARE r!3 r26 
#define DECLARE rl3 r27 
#define DECLARE rl3 r28 
#define DECLARE rl3 r29 
#define DECLARE rl3 r3 0 
#define DECLARE_rl3_r31 

/* 

* rl4 declare set 
*/ 

ttdefine DECLARE rl4 

ttdefine DECLARE rl4 rl5 

ttdefine DECLARE rl4 rl6 

ttdefine DECLARE rl4 rl7 

ttdefine DECLARE rl4 rl8 

ttdefine DECLARE rl4 rl9 

ttdefine DECLARE rl4 r20 

ttdefine DECLARE rl4 r21 

ttdefine DECLARE rl4 r22 

ttdefine DECLARE rl4 r23 

ttdefine DECLARE rl4 r24 

ttdefine DECLARE rl4 r2 5 

ttdefine DECLARE rl4 r2 6 

ttdefine DECLARE rl4 r27 

ttdefine DECLARE rl4 r28 

ttdefine DECLARE rl4 r29 

ttdefine DECLARE rl4 r3 0 
ttdefine DECLARE_r 1 4_r 3 1 
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*/ 

fldefine DECLARE rl5 
fldefine DECLARE rlS rlS 
fldefine DECLARE rl5 rl7 
fldefine DECLARE rl5 rl8 
fldefine DECLARE rl5 rl9 
define DECLARE rl5 r2 0 
fl define DECLARE rl5 r21 
#define DECLARE rl5 r22 
fldefine DECLARE r!5 r23 
#define DECLARE rl5 r24 
fldefine DECLARE r!5 r25 
fldefine DECLARE r!5 r2 6 
fldefine DECLARE rl5 r27 
#define DECLARE rl5 r28 
fldefine DECLARE rl5 r29 
fldefine DECLARE rl5 r30 
#define DECLARE_rl5_r31 

/* 

* rl6 declare set 
*/ 

fldefine DECLARE rl6 
fldefine DECLARE r!6 rl7 
fldefine DECLARE rl6 rl8 
fldefine DECLARE rl6 rl9 
fldefine DECLARE rl6 r20 
fldefine DECLARE rl6 r21 
fldefine DECLARE rl6 r22 
fldefine DECLARE rl6 r23 
fldefine DECLARE rl6 r24 
fldefine DECLARE rl6 r25 
fldefine DECLARE rl6 r26 
fldefine DECLARE rl6 r27 
fldefine DECLARE rl6 r28 
fldefine DECLARE rl6 r29 
fldefine DECLARE rl6 r30 
fldefine DECLARE_r 1 6_r 3 1 

/* 

* rl7 declare set 
*/ 

fldefine DECLARE rl7 
fldefine DECLARE rl7 rl8 
fldefine DECLARE rl7 rl9 
fldefine DECLARE rl7 r2 0 
fldefine DECLARE rl7 r21 
fldefine DECLARE rl7 r22 
fldefine DECLARE rl7 r2 3 
fldefine DECLARE rl7 r24 
fldefine DECLARE rl7 r25 
fldefine DECLARE rl7 r26 
fldefine DECLARE rl7 r27 
fldefine DECLARE rl7 r2 8 
fldefine DECLARE rl7 r2 9 
fldefine DECLARE r!7 r3 0 
fldefine DECLARE_rl7_r31 

/* 

* r!8 declare set 
*/ 

fldefine DECLARE rl8 

fldefine DECLARE rl8 rl9 

fldefine DECLARE rl8 r20 

fldefine DECLARE rl8 r21 

fldefine DECLARE rl8 r22 
fldefine DECLARE_r 1 8_r 2 3 
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#def ine 


DECLARE 


rl8 


r24 


#def ine 


DECLARE 


rl8 


r2 5 


#def ine 


DECLARE 




r ^ 


#def ine 


DECLARE 


rl8 




#def ine 


DECLARE 


rl8 


r2 8 


#def ine 


DECLARE 


rl8 




Kdef ine 


DECLARE 


rl8 




#def ine 


DECLARE 


_rl8_ 


_r3_ 


/* 
* rl9 


declare 


set 




*/ 

^define 


DECLARE 






#def ine 


DECLARE 


rl9 


r2 0 


#define 


DECLARE 


rl9 


r21 


#define 


DECLARE 


rl9 


r22 


#def ine 


DECLARE 


rl9 


r23 


#def ine 


DECLARE 


rl9 


r24 


#def ine 


DECLARE 


rl9 


r25 


#def ine 


DECLARE 


rl9 


r26 


#def ine 


DECLARE 


rl9 


r27 


#define 


DECLARE 


rl9 


r28 


#define 


DECLARE 


r!9 


r29 


#define 


DECLARE 


rl9 


r3 0 


#define 


DECLARE 


rl9 


r31 



/* 

* FPR 

*/ 
ttdefine 
#def ine 
#define 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#define 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
#def ine 
ttdefine 
#def ine 
#define 
#def ine 
ttdefine 
#def ine 
#define 
#def ine 
#define 
#define 
#define 
#def ine 



single preci: 



i declare set 



DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 
DECLARE 



f 0 

fO fl 
fO f2 
fO f3 
fO f4 
fO f5 
fO f6 
fO f7 
fO f8 
fO f9 
fO flO 
fO fll 
fO fl2 
fO fl3 
fO fl4 
fO fl5 
fO fl6 
fO fl7 
fO fl8 
fO fl9 
fO f20 
fO f21 
fO f22 
fO f23 
fO f24 
fO f25 
fO f26 
fO f27 
fO f28 
fO f29 
fO f30 
fO f31 



FPR double precision declare set 
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ttdefine DECLARE dO d2 
ttdefine DECLARE dO d3 
tt define DECLARE dO d4 
#define DECLARE dO d5 
^define DECLARE dO d6 
ttdefine DECLARE dO d7 
^define DECLARE dO d8 
define DECLARE do d9 
ttdefine DECLARE dO dlO 
ttdefine DECLARE dO dll 
#define DECLARE dO dl2 
ttdefine DECLARE dO dl3 
ttdefine DECLARE dO dl4 
ttdefine DECLARE dO dl5 
ttdefine DECLARE dO dl6 
ttdefine DECLARE do dl7 
ttdefine DECLARE dO dl8 
ttdefine DECLARE dO dl9 
ttdefine DECLARE dO d2 0 
ttdefine DECLARE dO d21 
ttdefine DECLARE dO d22 
ttdefine DECLARE dO d2 3 
« , ttdefine DECLARE dO d24 

ttdefine DECLARE dO d25 
CI ttdefine DECLARE dO d2 6 

fl ttdefine DECLARE dO d2 7 

ttdefine DECLARE dO d2 8 
'•^ ttdefine DECLARE dO d2 9 

; i| ttdefine DECLARE do d3 0 

ttdefine DECLARE_dO_d3 1 

/* 

LH * VMX declare set 

• */ 

ttdefine DECLARE vO 
M ttdefine DECLARE vO vl 

iy ttdefine DECLARE vO v2 

|_'t ttdefine DECLARE vO v3 

V ttdefine DECLARE vO v4 

4- ttdefine DECLARE vO v5 

p ttdefine DECLARE vO v6 

r"| ttdefine DECLARE vO v7 

5 = ttdefine DECLARE vO v8 

ttdefine DECLARE vO v9 
ttdefine DECLARE vO vlO 
ttdefine DECLARE vO vll 
ttdefine DECLARE vO vl2 
ttdefine DECLARE vO vl3 
ttdefine DECLARE vO vl4 
ttdefine DECLARE vO vl5 
ttdefine DECLARE vO vl6 
ttdefine DECLARE vO vl7 
ttdefine DECLARE vO vl8 
ttdefine DECLARE vO vl9 
ttdefine DECLARE vO v20 
ttdefine DECLARE vO v21 
ttdefine DECLARE vO v22 
ttdefine DECLARE vO v23 
ttdefine DECLARE vO v24 
ttdefine DECLARE vO v25 
ttdefine DECLARE vO v2 6 
ttdefine DECLARE vO v27 
ttdefine DECLARE vO v28 
ttdefine DECLARE vO v29 
ttdefine DECLARE vO v30 
ttdefine DECLARE_vO_v31 
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#endif /* end SALPPC_INC */ 

/* 

* END OF FILE salppc. inc 
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MC Standard Algorithms 



PPC Macro language Version 



File Name: SVE3 8BIT . MAC 

Description: Sum the elements of 3 signed byte vectors 
each of length N. 

sve3_8bit ( char *A, char *B, char *C, long *SUM, int N ) 

Restrictions : 



B and C must all be 16-byte aligned, 
must be a multiple of 16 and >= 16. 



Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision Date 
0.0 000605 



Engineer Reason 
fpl Created 



#include "salppc.i 



Input parameters 



#def ine 


A 


r3 


#def ine 


B 


r4 


#def ine 


c 


r5 


#def ine 


SUM 


r6 


#def ine 


N 


r7 


ttdefine 


A0p 


A 


#def ine 


BOp 


B 


#def ine 


COp 


C 


ttdefine 


Alp 


r8 


ttdefine 


Blp 


r9 


ttdefine 


Clp 


rlO 


ttdefine 


index 


rll 


ttdefine 




vO 


ttdefine 


one 


vl 


ttdefine 


aO 


v2 


ttdefine 


al 


v3 


ttdefine 


bO 


v4 


ttdefine 


bl 


v5 


ttdefine 


cO 


v6 


ttdefine 


cl 


v7 


ttdefine 


sumO 


v8 


ttdefine 


suml 


v9 


ttdefine 




vlO 



FUNC_PROLOG 

ENTRY_5( sve3_8bit, A, B, C, SUM, N ) 
USE THRU vlO ( VRSAVE_COND ) 



LI ( index, 0 ) 

VXOR( zero, zero, 
ADDIC C( N, N, -32 ) 
LVX( aO, AOp, index ) 

VSPLTISB( one, 1 ) 
LVX( bO, BOp, index ) 
ADD I ( Alp, AOp, 16 ) 

VXOR ( sumO , sumO , 



zero ) 
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ADDI ( Blp, BOp, 16 ) 

VXOR( suml, suml, suml ) 
ADDI ( Clp, COp, 16 ) 

VXOR( sum2, sum2, sum2 ) 
BLT ( dol6 ) 

LABEL ( loop ) 

ADDIC C( N, N, -32 ) 

LVX( cO, COp, index ) 

VMSUMMBM ( sumO, aO, one, sumO ) 

LVX( al. Alp, index ) 

VMSUMMBM ( suml, bO, one, suml ) 

LVX( bl, Blp, index ) 

VMSUMMBM ( sum2, cO, one, sum2 ) 

LVX( cl, Clp, index ) 
ADDI ( index, index, 32 ) 

VMSUMMBM ( sumO, al, one, sumO ) 

LVX{ aO, AOp, index ) 

VMSUMMBM ( suml, bl, one, suml ) 

LVX( bO, BOp, index ) 

VMSUMMBM ( sum2, cl, one, sum2 ) 
BGE { loop ) 

CMPWK N, -32 ) 
BEQ ( combine ) 

LABEL ( dol6 ) 

LVX( cO, COp, index ) 

VMSUMMBM ( sumO, aO, one, sumO ) 

VMSUMMBM ( suml, bO, one, suml ) 

VMSUMMBM ( sum2, cO, one, sum2 ) 

LABEL ( combine ) 

VADDUWM( sumO, sumO , suml ) 
VADDUWM( sumO, sumO , sum2 ) 
VSUMSWS ( sumO , sumO , zero ) 
VSPLTW( sumO, sumO, 3 ) 
STVEWX( sumO, 0, SUM ) 

FREE THRU_vl 0 ( VRSAVE_COND ) 
RETURN 

FUNC_EPILOG 
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Majority Voter /Sync Control logic TOP LEVEL Module: voter_sync . vhd 

Description: This Module is the top level of the 
Majority Voter and Raceway Sync Logic 



Author 

Date 

Date 



Steven Imperiali 
7-05-2000 

10-25-2000 Modified cable clock and sync 



PLD handles the following functions: 

1) Raceway clock source and skew control 

2) Raceway sync generation 

3) Majority voter logic 

4) I2C reset logic 

5) Inverter for the HS LED signal 



- 

m 



LIBRARY IEEE; 

USE IEEE.STD LOGIC 1164. ALL; 
USE STD . TEXTIO . ALL; 
use ieee.std logic arith.all; 
use ieee . std_logic_unsigned . all ; 



ENTITY voter_sync IS 
PORT ( 

elk 66 pal6 
elk 33 pall 
reset 0 
x rst brd 0 
x rst brd 1 
pll rng sel 
pll freq sel 
fb sk sel 
fb dev by 2 0 
main sk selO 
main sk sell 
jk sk selO 
jk sk sell 
jxl elk oe 
jx2 elk oe 
sw elk mode2_ 
mux elk selO 
mux clk_sell 



testn 
tmsO 

rsync x ndO 
rsync x ndl 
rsync x nd2 
rsync x nd3 
rsync x pxbO 
rsync_x_xbar 

ndO resetreq 0 
ndl resetreq 0 
nd2 resetreq 0 
nd3 resetreq 0 
pq resetreq_0 
resetvote_0 



IN 


std logic; 


IN 


std 


logic- 


IN 


std 


logic; 


OUT 


std 




OUT 


std 


logic- 


OUT 


std 


logic; 


OUT 


std 


logic ; 


:OUT 


std 


logic- 


:OUT 


std 


logic ,- 


:OUT 


std 


logic ; 


:OUT 


std 




:OUT 


std 


logic ; 


:OUT 


std 


logic ; 


:OUT 


std 


logic- 


:OUT 


std 


logic ; 


:IN 


std 


logic a 


:OUT 


std 


logic ; 


:OUT 


std_ 


logic ; 


: IN 


std 


logic ; 


:IN 


std 


logic- 


:OUT 


std 


logic ,- 


:0UT 


std 


logic ; 


:OUT 


std 


logic ; 


:OUT 


std 


logic; 


:OUT 


std logic ; 


:OUT 


std 


_logic; 


: IN 


std 


logic; 


: IN 


std 


logic; 


:IN 


std logic; 


: IN 


std 


logic; 


:IN 


std 


logic; 


:OUT 


std 


logic; 



nd0_ekstpreqnd0_0 : IN std_logic; 
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ndO ckstpregndl 0 
ndO ckstpreqnd2 0 
ndO ckstpreqnd3 0 
ndO ckstpregpq 0 
ndl ckstpreqndO 0 
ndl ckstpreqndl 0 
ndl ckstpreqnd2 0 
ndl ckstpreqnd3 0 
ndl ckstpreqpq 0 
nd2 ckstpreqndO 0 
nd2 ckstpreqndl 0 
nd2 ckstpreqnd2 0 
nd2 ckstpreqnd3 0 
nd2 ckstpreqpq 0 
nd3 ckstpreqndO 0 
nd3 ckstpreqndl 0 
nd3 ckstpreqnd2 0 
nd3 ckstpreqnd3 0 
nd3 ckstpreqpq 0 
pq ckstpreqndO 0 
pq ckstpreqndl 0 
pq ckstpreqnd2 0 
pq ckstpreqnd3 0 
pq ckstpreqpq_0 
pq ckstopin 0 
ndO ckstopin 0 
ndl ckstopin 0 
nd2 ckstopin 0 
nd3_ckstopin_0 

i2c_rst_0 
sda 
scl 

pxbO hs_led 
hs_led 



• IN 


std logic ; 


• IN 


std logic; 


• IN 




• IN 


std logic ; 


• IN 


std logic ; 


: IN 


std logic; 




std logic; 


• IN 


std logic; 


■ IN 


std logic; 


• IN 


std logic; 


• IN 




- IN 


std logic; 


• IN 


std logic ; 


■ IN 


std logic ; 


• IN 


std logic; 


• IN 


std logic; 


| 




• IN 


std logic; 


:IN 


std logic ; 


-IN 


std logic; 




std logic; 


• IN 


std logic; 






• IN 


std logic; 


:OUT 


std logic; 


:OUT 


std logic; 


:OUT 


std logic; 


:OUT 


std logic; 


:OUT 


std_logic; 


IN 


std logic; 


INOUT 


std logic; 


INOUT 


std logic ; 


IN 


std logic; 


OUT 


std logic 



END voter_sync; 

ARCHITECTURE TOP_LEVEL_voter_sync OF voter_sync IS 



** Component Declearation 



elk 66 pal6 


IN 


std 


logic ; 


reset 0 




IN 


std 


logic ; 


request 0 


0 


IN 


std 


logic; 


requestl 


0 


IN 


std 


logic ; 


request2 


0 


IN 


std 


logic ; 


request3 


0 


IN 


std 


logic; 


request4 


0 


IN 


std 


logic ; 


healthyO 


1 


IN 


std 


logic ; 


healthyl 


1 


IN 


std 


logic; 


healthy2 


1 


IN 


std 


logic; 


healthy3 


1 


IN 


std 


logic; 


healthy4 


1 


IN 


std 


logic; 


voteout_ 


0 


OUT 


std 


logic) 



END COMPONENT ; 
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Signals to Connect All of the Components Together 



Signal healthyO 1 
Signal healthyl 1 
Signal healthy2 1 
Signal healthy3 1 
Signal healthy4_l 
Signal sync dl 
Signal sync d2 
Signal sync d3 
Signal ndO ckstop_ 
:std logic; 
Signal g ndO resetreq 0 :std logic; 
Signal g ndl resetreq 0 :std logic- 
Signal g nd2 resetreq 0 : std logic; 
Signal g_nd3_resetreq_0 :std_logic; 



std logic ; 
std logic; 
std logic ; 
std logic; 
std logic; 
std logic; 
std logic ; 
std logic; 
ndl_ckstop_0, nd2_ckstop_0 , 



nd3_ckstop_0, pq_ckstop_0 



** Begin Architecture Here (Instantiations) 



nd0_ckstop voter : m_voter PORT Map ( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqndO 0, 
ndl ckstpreqndO 0 , 
nd2 ckstpreqndO 0, 
nd3 ckstpreqndO 0 , 
pq ckstpreqndO_0, 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
nd0_ckstop_0) ; 



ndl_ckstop voter : m_voter PORT Map ( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqndl 0, 
ndl ckstpreqndl 0, 
nd2 ckstpreqndl 0, 
nd3 ckstpreqndl 0, 
pq ckstpreqndl_0 , 
healthyO 1, 
healthyl 1, 
heal thy 2 1, 
healthy3 1, 
healthy4 1, 
ndl_ckstop_0) ; 



nd2_ckstop voter : m_voter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqnd2 0 , 
ndl ckstpreqnd2 0, 
nd2_ckstpreqnd2_0 , 
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nd3 ckstpreqnd2 0, 
pg ckstpreqnd2_0 , 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
nd2_ckstop_0) ,- 



nd3_ckstop voter : m_voter PORT Map( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqnd3 0, 
ndl ckstpreqnd3 0, 
nd2 ckstpreqnd3 0, 
nd3 ckstpreqnd3 0 , 
pq ckstpreqnd3_0 , 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
nd3_ckstop_0) ; 



pq_ckstop voter : m voter PORT Map ( 
elk 66 pal6, 
reset 0, 

ndO ckstpreqpq 0, 
ndl ckstpreqpq 0, 
nd2 ckstpreqpq 0, 
nd3 ckstpreqpq 0, 
pq ckstpreqpq_0, 
healthyO 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
pq_ckstop_0) ; 



_ - ##### ############################################### 

-- this section was added to force a board level reset when 

-- the 8240 has a watchdog failure. 

-- this should have been done by feeding the 8240" s WDFAIL 
--to the reset PLD instead of forcing the 8240 's resetreq 
--to drive all other resetrequests . 

g ndO resetreq 0 <= ndO resetreq 0 AND pq resetreq 0; 
g ndl resetreq 0 <= ndl resetreq 0 AND pq resetreq 0 ; 
g nd2 resetreq 0 <= nd2 resetreq 0 AND pq resetreq 0 ; 
g _nd3_resetreq_0 <= nd3_resetreq_0 AND pq_resetreq_0 ; 

_ - ##################################################### 



reset_req voter : m voter PORT Map ( 
elk 66 pal6, 
reset 0, 

g ndO resetreq 0, 
g ndl resetreq 0, 
g nd2 resetreq 0, 
g_nd3__resetreq_0 , 
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pq resetreq_0, 
heal thy 0 1, 
healthyl 1, 
healthy2 1, 
healthy3 1, 
healthy4 1, 
resetvote_0) ; 



healthyO 1 <= ndO ckstop 0; 
healthyl 1 <= ndl ckstop 0; 
healthy2 1 <= nd2 ckstop 0 ; 
healthy3 1 <= nd3 ckstop 0 ; 
healthy4_l <= pq_ckstop_0 ; 



ndO ckstopin 
ndl ckstopin 
nd2 ckstopin 
nd3 ckstopin 
pq_ckstopin_0 



= ndO ckstop 0 ,- 

= ndl ckstop 0; 

= nd2 ckstop 0; 

= nd3 ckstop 0; 

= pq_ckstop_0; 



sda <= clk_33_pall WHEN ' 0', 
' Z ' WHEN ' 1 ' , 

'Z' WHEN OTHERS; 

WITH i2c_rst_0 SELECT 

;I1 scl <= clk_33_pall WHEN '0', 

' Z ' WHEN ' 1 ' , 

<Z' WHEN OTHERS ; 



: NOT(pxb0_hs_led) ; 



-- Sync Control 

process (clk_66_pal6 , reset_0) 

BEGIN 

IF (reset 0 = '0') THEN 



sync dl 






■1 


sync d2 






'l 


sync d3 






'l 


rsync x 


ndO 




' 0 


rsync x 


ndl 




' 0 


rsync x 


nd2 




' 0 


rsync x 


nd3 




1 0 




pxbO 




1 0 


rsync_x_ 


xbar 




' 0 



EIjSIF (testn = '0' AND reset 0 

rsync x ndO <= tmsO 

rsync x ndl <= tmsO 

rsync x nd2 <= tmsO 

rsync x nd3 <= tmsO 

rsync x pxbO <= 1 0 ' ; 

rsync_x_xbar <= 1 0 ' ; 

ELS IF rising edge (elk 66 pal6) THEN 
sync dl <= NOT(sync dl) ; 

sync_d2 <= (NOT (sync_d2) AND sync_dl OR sync_d2 AND 
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END IF; 
END process 



NOT { sync dl) } 

sync d3 <= (NOT (NOT (sync dl) AND sync_d2) 

rsync x ndO <= sync d3 ; 

rsync x ndl <= sync d3 ; 

rsync x nd2 <= sync d3 ,- 

rsync x nd3 <= sync d3 ; 

rsync x pxbO <= sync d3 ; 

rsync_x_xbar <= sync_d3 ; 



x rst brd 0 <= reset 0; 

x rst_brd_l <= NOT(reset_0) ; 



WITH sw elk TOOde2 1 SELECT 

mux_clk_sel0 



mux clk_sell 



fb_dev_by_2_0 



pll_f reg_sel 



1 0 ' 




WHEN 


"00", 


- 6 6MHz 






'0' 


WHEN 


01", 




1 1 1 


WHEN 


"10", 


- 33MHz 






' 0 ' 


WHEN 


11", 






•1' 


WHEN OTHERS ; 


SELECT 










'0' 




WHEN 


"00", 








■1' 


WHEN 


01", 




'1' 


WHEN 


"10", 








' 0 ' 


WHEN 


11", 




•1' 


WHEN OTHERS; 




SELECT 










' 0 ■ 




WHEN 


"00", 








1 Z ' 


WHEN 


'01", 




1 Z' 


WHEN 


"10", 








' 0 • 


WHEN 


'11", 




' 1 ' 


WHEN OTHERS; 




SELECT 










■ 1 1 




WHEN 


"00", 






•1' 


WHEN 


"01", 




' 1 1 




WHEN 


"10", 






■1' 


WHEN 


"11", 




'1' 




WHEN 


OTHERS ; 




SELECT 










1 1 ' 




WHEN 


"00", 






•1' 


WHEN 


"01", 




'1' 




WHEN 


"10", 






'1' 


WHEN 


"11", 




■1' 




WHEN 


OTHERS; 




SELECT 










' 1 ' 




WHEN 


"00", 








■1' 


WHEN 


'01", 




' 1 ' 


WHEN 


"10", 








'1' 


WHEN 


'11" , 




■1 


WHEN OTHERS ; 




SELECT 










'Z' 




WHEN 


"00", 








' 0 * 


WHEN 


"01" , 




' 0 


WHEN 


"10" , 








' Z ' 


WHEN 


"11" , 



33MHz cable 1 
able 2 
66 MHz local 
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select 0 skew for all modes 



WHEN OTHERS; 



WHEN "00", 
■ Z ' WHEN 

WHEN "10", 
• Z ' WHEN 

WHEN OTHERS ; 



WHEN "00", 
' Z ' WHEN 
WHEN "10", 
' Z ' WHEN 
WHEN OTHERS ; 



WITH sw_clk mode2 1 SELECT 
jk_sk_sel0 <= ' Z' 



WITH sw_clk mode2 1 SELECT 
jk_sk_sell <= •Z' 



WHEN "00", 
1 Z ' WHEN 

WHEN "10", 
' Z 1 WHEN 

WHEN OTHERS ; 



WHEN "00", 
' Z 1 WHEN 

WHEN "10", 
• Z ' WHEN 

WHEN OTHERS; 



WHEN "00", 
' Z ' WHEN 

WHEN "10", 
1 Z ' WHEN 

WHEN OTHERS; 



END TO P_LE VEL_vot e r_syn c ; 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: ZDOTPR4 VMX.K 

Description: CPP Source code for Vector Single Precision 
Split Complex Dot Product given that input 
vectors are relivatively unaligned. 

, J, C, N) 



Formula: C[0] = sum (A->realp [ml] *B->realp [mJ] 

-/+ A->imagp [ml] *B->imagp [mJ] ) 
C[l] = sum (A->realp [ml] *B->imagp [mJ] 

+/- A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-l 

Mercury Computer Systems, Inc. 
Copyright (c) 2000 All rights reserved 



Revision Date 
0.0 000608 



Engineer Reason 
fpl Created (from zdotpr - 



#include " salppc . inc" 
/** 

ESAL CPP definitions 
**/ 

#undef FUNC ENTRY 
LOAD A 
LOAD B 
SUFFIX 



#undef 
#undef 
#undef 



#if defined ( VMX_SAL ) 



#def ine 
#def ine 
#def ine 
#def ine 
ttdefine 



FUNC ENTRY 
FUNC CONJ ENTRY 
LOAD A{ vT, rA, 
LOAD B( vT, rA, 
SUFFIX ( label ) 



zdotpr4 vmx 
_zidotpr4 vmx 
rB ) LVX{ vT, rA, rB ) 
rB ) LVX( vT, rA, rB ) 
label 



#elif defined ( VMX_NN ) 

#define FUNC ENTRY zdotpr4 vmx nn 

#define FUNC CONJ ENTRY _zidotpr4_vmx_nn 

#define LOAD A( vT, rA, rB ) LVXL ( vT, rA, rB ) 

ttdefine LOAD B( vT, rA, rB ) LVXL ( vT, rA, rB > 

#define SUFFIX ( label ) label##_nn 

#elif defined ( VMX_NC ) 

#define FUNC ENTRY zdotpr4 vmx nc 

#define FUNC CONJ ENTRY _zidotpr4_vmx_nc 

#define LOAD A( vT, rA, rB ) LVXL ( vT, rA, rB ) 

ttdefine LOAD B( vT, rA, rB ) LVX ( vT, rA, rB ) 

#define SUFFIX ( label ) label##_nc 



#elif defined ( VMX_CN ) 



#define 
#def ine 
#define 
#def ine 
#def ine 



FUNC ENTRY 
FUNC CONJ ENTRY 
LOAD A( vT, rA, 
LOAD B( vT, rA, 
SUFFIX ( label ) 



zdotpr4 vmx cn 
_zidotpr4 vmx cn 
rB ) LVX{ vT, rA, rB ) 
rB ) LVXL( vT, rA, rB ) 
label##_cn 



#elif defined ( VMX_CC ) 
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#define 
#define 
#define 
#define 
ttdefine 

#else 
terror 



FUNC ENTRY 
FUNC CONJ ENTRY 
LOAD A( vT, rA, 
LOAD B( vT, rA, 
SUFFIX ( label ) 



zdotpr4 vTiix cc 
_zidotpr4 vmx cc 
rB ) LVX( vT, rA, rB ) 
rB ) LVX( vT, rA, rB ) 
label##_cc 



YOU MUST DEFINE VMX_xxx, where x = C or N 



#endif 

ttdefine VREGSAVE_COND VRSAVE_COND /* defined as 7 in salppc.inc */ 
/** 

Local CPP definitions 
**/ 

#define NMASK2 0x8 

#define NMASK1 0x4 

#define NSHIFT 4 

#define ADDRE S S_I NCREME NT 16 

/** 

Input args 
**/ 

#define A r3 
ttdefine I r4 
ttdefine B r5 
ttdefine J r6 
ttdefine C r7 
ttdefine N rB 
ttdefine EFLAG r9 

/** 

Split complex parameters 
**/ 

ttdefine ArO A 
ttdefine AiO rlO 
ttdefine BrO B 
ttdefine BiO rll 
ttdefine Cr C 
ttdefine Ci rl2 

/** 

Local registers 
**/ 

ttdefine count r4 
ttdefine rtmpO r4 
ttdefine rtmpl rl3 

ttdefine Arl rl3 

ttdefine Ail rl4 

ttdefine Ar2 rl5 

ttdefine Ai2 rl6 

ttdefine Ar3 rl7 

ttdefine Ai3 rl8 



ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 
ttdefine 



Brl rl9 
Bil r20 
Br2 r21 
Bi2 r22 
Br 3 r23 
Bi3 r24 
aoffset r25 
coffset r25 
boffset r2 6 
addr_incr r27 



2 
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/** 

VMX registers 
**/ 

#define rsumr vO 
#define rsumi vl 
#define isumr v2 
#define isumi v3 
#define rsumO v4 
ttdefine rsuml v5 
#define isumO v6 
#define isuml v7 

#define arO v4 

ttdefine aiO v5 

#define arl v6 

#define ail v7 

#define ar2 v8 

ttdefine ai2 v9 

#define ar3 vlO 

ttdefine ai3 vll 





ttdefine 


brO vl2 




Li: 


#def ine 


biO vl3 




#def ine 


brl vl4 




m 


#define 


bil vl5 






ttdefine 


br2 vl6 






#def ine 


bi2 vl7 






#define 


br3 vl8 






#define 


bi3 V19 






#define 


apC v20 






#def ine 


atrO v21 






#def ine 


atiO v22 






#define 


atrl v23 






#def ine 


atil v24 






#def ine 


atr2 v25 






#define 


ati2 v26 






#define 


atr3 v2 7 






#def ine 


ati3 v28 






/** 








FPU registers 






**/ 








ttdefine 


far 


fO 




#define 


fbr 


fl 




#define 


fai 


f2 




#def ine 


fbi 


f3 




#define 


f rsumr 


f4 




#def ine 


f rsumi 


f5 




ttdefine 


f isumi 


f 6 




#define 


f isumr 


f 7 




#define 


frsum 


f 8 




#def ine 


f isum 


f 9 




#def ine 


rsum vmx 


: flO 




#define 


isum_vmx 


: fll 




/** 







Begin code text, Save some registers 
Here for conjugate inner product 
**/ 

U ENTRY ( FUNC CONJ_ENTRY ) 
MR(rtmpO, Cr) 
MR ( Cr , Ci) 
MR (Ci , rtmpO) 
MR(rtmpO, BrO) 
MR (BrO, BiO) 
MR (BiO, rtmpO) 
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/** 

Here for normal inner product 
**/ 

FUNC PROLOG 

U_ENTRY( FUNC ENTRY ) 

DECLARE fO fll 

DECLARE r3 r27 

DECLARE_vO_v2 8 

/** 

Initial setup code 
**/ 

SAVE rl3 r27 

USE THRU v2 8 ( VREGSAVE_COND ) 

LFS ( frsumr, ArO, 0 ) 

FSUBS (f rsumr, frsumr, frsumr) 

FMR(frsumi, frsumr) 

FMR(fisumr, frsumr) 

FMR (fisumi, frsumr) 

FMR {rsum vmx, frsumr) 

FMR ( i sum_vmx , frsumr ) 

/** 

Process unaligned vector section first 
**/ 

ir* LABEL ( SUFFIX ( cont ) ) 

fO GET_VMX UNALIGNED_COUNT( count, BrO ) 

ft LI( aoffset, 0 ) 

*tt LI( boffset, 0 ) 

W BEQ ( SUFFIX { aligned ) ) 

03 SUB( N, N, count ) /* adjust N for after loop */ 

Here to do first 1 to 3 points using standard FP 
Store result for later post_loop processing 
**/ 

LFSX( far, ArO, aoffset ) 
LFSX( fai, AiO, aoffset ) 
DECR C( count ) 
LFSX( fbr, BrO, boffset ) 
LFSX( fbi, BiO, boffset ) 
FMULS ( frsumr, far, fbr ) 
FMULS ( frsumi, fai, fbi ) 
FMULS ( fisumi, far, fbi ) 
FMULS ( fisumr, fai, fbr ) 
ADDI ( ArO, ArO, 4 ) 
ADDI ( AiO, AiO, 4 ) 
ADDI ( BrO, BrO, 4 ) 
ADDI ( BiO, BiO, 4 ) 
BEQ ( SUFFIX ( aligned ) ) 
/** 

Loop does 1 or 2 more sum updates 
**/ 

LABEL ( SUFFIX ( pre_loop ) ) 
LFSX( far, ArO, aoffset ) 
LFSX( fai, AiO, aoffset ) 
DECR C( count ) 
LFSX( fbr, BrO, boffset ) 
LFSX( fbi, BiO, boffset ) 
FMADDS ( frsumr, far, fbr, frsumr ) 
ADDI ( ArO, ArO, 4 ) 

FMADDS ( frsumi, fai, fbi, frsumi ) 
ADDI ( AiO, AiO, 4 ) 

FMADDS ( fisumi, far, fbi, fisumi ) 
ADDI ( BrO, BrO, 4 ) 
FMADDS ( fisumr, fai, fbr, fisumr ) 
ADDI ( BiO, BiO, 4 ) 
BNE ( SUFFIX ( pre_loop) ) 

/** 

Here for VMX aligned loop code 



,0 
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o 



Prepare for loop entry: assign loop pointers, counters 
**/ 

LABEL ( SUFFIX ( aligned ) ) 
SRWI C( count, N, 4 ) /* 
LVSL{ apC, ArO, aoffset ) 
LI( aoffset, 0 ) 
LI( boffset, 0 ) 



16 per trip */ 



ADD I ( 


Arl, 


ArO, 


16 


VX0R( 




r , rsumr , 


ADD I ( 


Ar2, 


ArO, 


32 


ADD I ( 


Ar3, 


ArO, 


48 


ADDI ( 


Ail, 


AiO, 


16 


VX0R( 




l, lsumi, 


ADDI ( 


Ai2, 


AiO, 


32 


ADDI ( 


Ai3, 


AiO, 


48 


ADDI ( 


Brl, 


BrO, 


16 


VX0R{ 


rsumi , rsumi , 


ADDI ( 


Br2, 


BrO, 


32 


ADDI { 


Br3, 


BrO, 


48 


ADDI ( 


Bil, 


BiO, 


16 


ADDI ( 


Bi2, 


BiO, 


32 


VXOR( 


isum: 


r, isumr, 


ADDI ( 


Bi3, 


BiO, 


48 



BEQ ( SUFFIX (two_left) ) 



Loop windin section 



*/ 

LOAD A( atrO, 

LOAD A( atiO, 

LOAD A( atrl, 

LOAD_A( atil. 



aoffset ) 
aoffset ) 
aoffset ) 
aoffset ) 



aoffset ) 
aoffset ) 
atrl, apC ) 
boffset ) 



LOAD A{ atr2, Ar2 
LOAD A( ati2, Ai2 
VPERM( arO, atrO, 
LOAD B( brO, BrO, 
LOAD B( biO, BiO, boffset ) 
DECR C( count ) 
VPERM( aiO, atiO, atil, apC ) 
LOAD B( brl, Brl, boffset ) 
VPERM( arl, atrl, atr2, apC ) 
LOAD A( atr3, Ar3 , aoffset ) 
BR ( SUFFIX ( mid_loop ) ) 

/** 

Top of vector loop 
**/ 

LABEL ( SUFFIX ( loop ) ) 
/* { */ 

LOAD A( atr2, Ar2 , aoffset ) 
VMADDFP ( rsumr, ar3 , br3 , rsumr ) 
LOAD A( ati2, Ai2 , aoffset ) 
VPERM( arO, atrO, atrl, apC ) /* u 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
LOAD B( brO, BrO, boffset ) 
LOAD B{ biO, BiO, boffset ) 
DECR C( count ) 
VPERM( aiO, atiO, atil, apC ) 
LOAD B( brl, Brl, boffset ) 
VPERM( arl, atrl, atr2 , apC ) 
VMADDFP ( isumi, ar3 , bi3, isumi ) 
LOAD A( atr3, Ar3 , aoffset ) 
VMADDFP ( isumr, ai3, br3 , isumr ) 

/** 



last pass value 
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Loop entry 
**/ 

LABEL ( SUFFIX ( mid loop ) ) 

VMADDFP ( rsumr, arO, brO, rsumr ) 
VPERM( ail, atil, ati2, apC ) 
VMADDFP ( rsumi, aiO, bio, rsumi ) 
LOAD A( ati3, Ai3, aoffset ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
LOAD B( bil, Bil, boffset ) 
ADDI ( aoffset, aoffset, 64 ) 
VPERM( ar2, atr2, atr3, apC ) 
VMADDFP ( isumi, arO, biO, isumi ) 
LOAD B( br2, Br2 , boffset ) 
VMADDFP ( rsumr, arl, brl, rsumr ) 
LOAD B( bi2, Bi2, boffset ) 
VMADDFP ( isumr, ail, brl, isumr ) 

/** 
Loop exit 

VPERM{ ai2, ati2, ati3, apC ) 
BEQ ( SUFFIX (loop exit ) ) 
LOAD A( atrO, ArO, aoffset ) 
VMADDFP ( rsumi, ail, bil, rsumi ) 
* LOAD A( atiO, AiO, aoffset ) 

f» VMADDFP ( isumi, arl, bil, isumi ) 

LOAD B( br3, Br3, boffset ) 
M VPERM{ ar3, atr3, atrO, apC ) 

M VMADDFP ( rsumr, ar2, br2 , rsumr } 

f| LOAD A( atrl, Arl, aoffset ) 

VMADDFP ( rsumi, ai2, bi2, rsumi ) 
VPERM( ai3, ati3, atiO, apC ) 
VMADDFP ( isumi, ar2, bi2, isumi ) 
LOAD B( bi3, Bi3, boffset ) 
ADDI ( boffset, boffset, 64 ) 
LOAD A( atil, Ail, aoffset ) 
VMADDFP ( isumr, ai2, br2 , isumr ) 
/* } */ 

BR ( SUFFIX ( loop ) ) 

/** 

windout section 
**/ 

LABEL { SUFFIX (loop exit ) ) 

LOAD A( atrO, ArO, aoffset ) 
VMADDFP ( rsumi, ail, bil, rsumi ) 
LOAD A( atiO, AiO, aoffset ) 
VMADDFP ( isumi, arl, bil, isumi ) 
LOAD B{ br3, Br3 , boffset ) 
VPERM( ar3, atr3, atrO, apC ) 
VMADDFP ( rsumr, ar2, br2, rsumr ) 
VMADDFP ( rsumi, ai2, bi2, rsumi ) 
VPERM( ai3, ati3, atiO, apC ) 
VMADDFP ( isumi, ar2 , bi2, isumi ) 
LOAD B( bi3, Bi3, boffset ) 
ADDI ( boffset, boffset, 64 ) 
VMADDFP ( isumr, ai2, br2, isumr ) 
VMADDFP ( rsumr, ar3 , br3 , rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
VMADDFP ( isumi, ar3 , bi3, isumi ) 
VMADDFP ( isumr, ai3, br3 , isumr ) 

/** 

Remaining sum updates 
**/ 

LABEL ( SUFFIX (two_left) ) 

ANDI_C( count, N, 0x8 ) /* bit 3 */ 
BEQ ( SUFFIX (one_left ) ) 

LOAD_B( brO, BrO, boffset ) 

6 

Page No. 418 



EV 093 931 868 US 
Page No. 445 

zdotpr4_vi 



LOAD B( biO, BiO, boffset ) 
LOAD B{ brl, Brl, boffset ) 
LOAD B( bil, Bil, boffset ) 
ADDI ( boffset, boffset, 32 ) 

LOAD A( atrO, ArO, aoffset ) 

LOAD A( atiO, AiO, aoffset ) 

LOAD A( atrl, Arl, aoffset ) 

LOAD A( atil, Ail, aoffset ) 

LOAD A( atr2, Ar2, aoffset ) 

LOAD A( ati2, Ai2, aoffset ) 

ADDI ( aoffset, aoffset, 32 ) 

VPERM( arO, atrO, atrl, apC ) /* 

VPERM( aiO, atiO, atil, apC ) 

VPERM( arl, atrl, atr2 , apC ) 

VPERM( ail, atil, ati2, apC ) 

VMADDFP ( rsumr, arO, brO, rsumr ) 

VMADDFP ( rsumi, aiO, biO, rsumi ) 

VMADDFP ( isumr, aiO, brO, isumr ) 

VMADDFP ( isumi, arO, biO, isumi ) 

VMADDFP ( rsumr, arl, brl, rsumr ) 

VMADDFP ( isumr, ail, brl, isumr ) 

VMADDFP ( rsumi, ail, bil, rsumi ) 

VMADDFP ( isumi, arl, bil, isumi ) 
VMR(atr3, atrl) 
VMR(ati3, atil) 

LABEL ( SUFFIX (one_left) ) 

ANDI_C( count, N, 0x4 ) /* bit 2 */ 
BEQ ( SUFFIX (combine ) ) 

LOAD B( brO, BrO, boffset ) 
LOAD B( biO, BiO, boffset ) 
ADDI ( boffset, boffset, 16 ) 

LOAD A( atrO, ArO, aoffset ) 

LOAD A( atiO, AiO, aoffset ) 

LOAD A( atrl, Arl, aoffset ) 

LOAD A( atil, Ail, aoffset ) 

ADDI ( aoffset, aoffset, 16 ) 



last pass value */ 



/* uses last pass value */ 



VPERM( arO, atrO, atrl, apC ) 
VPERM( aiO, atiO, atil, apC ) 

VMADDFP ( rsumr, arO, brO, rsumr ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
VMADDFP ( isumi, arO, biO, isumi ) 

/** 

combine partial sums, permute, write out results 
**/ 

LABEL ( SUFFIX (combine) ) 

VSUBFP( rsumr, rsumr, rsumi ) /* rsumr = rsumr - rsi 

VADDFP ( isumi, isumi, isumr ) 
/** 

8 bytes/cycle shuffle: 

real/imag logic should be intermixed for efficiency 
*/ 



rsumr) 
incr, N, 0x3 ) 
isumi, isumi) 
rsumr, rsumr) 
SUB ( addr incr, N, addr incr ) 
VMRGLW (isumi, isumi, isumi) 



VMRGHW(rsumO, 
ANDI C( addr 
VMRGHW(isumO, 
VMRGLW (rsumi, 



/* offset index for : 



7 
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VADDFP ( rsumO, rsuml, rsumO ) 

SLWKaddr incr, addr incr, 2) /* byte offset */ 
VADDFP ( isutnO, isuml, isumO ) 

VMRGHW ( rsuml , r sumO , r sumO ) 
ADD (ArO, ArO, addr incr) 
VMRGHW (isuml, isumO, isumO) 
ADD (AiO , AiO, addr incr) 
VMRGLW ( rsumO , r sumO , r sumO ) 
ADD (BrO , BrO, addr incr) 
VMRGLW (isumO, isumO, isumO) 
ADD (BiO, BiO, addr incr) 
VADDFP ( rsumr, rsuml, rsumO ) 
LI(coffset, 0) /* needed for output */ 
VADDFP ( isumi, isuml, isumO ) 
/** 

4 byte stores 
**/ 

STVEWX( rsumr, Cr, coffset ) 
STVEWX( isumi, Ci , coffset ) 
/** 

Remainders of 1-3 more to do 
**/ 

ANDI_C{ N, N, 3 ) 
LFS( rsum vmx, Cr, 0 ) 
LFS ( isum vmx, Ci , 0 ) 
BEQ ( SUFFIX ( scaler_vmx_combine ) ) 
/** 

Here to do last 1-3 points using standard FP 
**/ 

LABEL ( SUFFIX ( post_loop ) ) 
LFS( far, ArO, 0 ) 
LFS( fai, AiO, 0 ) 
DECR_C( N ) 
LFS( fbr, BrO, 
LFS( fbi, BiO, 





FMADDS ( f rsumr, 


far, 


fbr. 






FMADDS ( frsumi, 


fai, 


fbi. 


frsumi 




FMADDS ( f isumi, 


far, 


fbi, 


f isumi 




FMADDS ( fisumr, 


fai, 


fbr, 


fisumr 




ADDKArO, ArO, 


4) 






ill 


ADDKBrO, BrO, 


4) 






ADDI (AiO , AiO, 


4) 







ADDI (BiO, BiO, 4) 
BNE ( SUFFIX ( post_loop) ) 

/** 

Write out result 
**/ 

LABEL ( SUFFIX ( scaler vmx combine ) 
FSUBS ( frsura, f rsumr, frsumi ) /* 
FADDS ( fisum, f isumi, fisumr ) 
FADDS ( f rsum, f rsum, rsum vmx ) 
FADDS ( fisum, fisum, isum_vmx ) 
STFS( frsum, Cr, 0 ) 
STFS( fisum, Ci, 0 ) 

/** 
return 

**/ 

LABEL ( SUFFIX (ret) ) 

FREE THRU v28 ( VREGSAVE_COND ) 

REST rl3_r27 

RETURN 
FUNC_EPILOG 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: ZDOTPR4 VMX.MAC 

Description: Vector Single Precii 
CPP dummy file for 1 



on Complex Dot Product 
aligned vector processing 



Entry/params : ZD0TPR4 VMX (A, I, B, J, C, N) 
Formula: C[0] = sum (A->realp [ml] *B->realp [mJ] 

- A- >imagp [ml ] *B- >imagp [mJ] ) 
C[l] = sum (A->realp [ml] *B->imagp [mJ] 

+ A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-l 

Mercury Computer Systems, Inc. 
Copyright (c) 1998 All rights reserved 



Revision Date 
0.0 000607 



Reason 

Created (from zdotpr vmx.mac) 



#if defined ( BUILD_MAX ) 

#undef VMX SAL 
#undef VMX NN 
#undef VMX NC 
#undef VMX CN 
#undef VMX_CC 

#if ! defined ( COMPILE_ESAL_JUMP TABLE ) 

/* 1 variant: _zdotpr4_vmx ( ) */ 

ttdefine VMX SAL 
#include "zdotpr4_vmx.k" 



#else 

#define VMX NN 

#include ,, zdotpr4_vmx.k" 

#undef VMX NN 
#define VMX NC 
#include "zdotpr4_vmx.k" 

#undef VMX NC 

#define VMX CN 

#include "zdotpr4_vmx.k" 

#undef VMX CN 

#define VMX CC 

#include "zdotpr4_vmx.k" 

ttundef VMX_CC 

#endif 

#endif 



/* 5 variants based on ESAL flag */ 



end COMPILE_ESAL_JUMP_TABLE */ 
end BUI LD_MAX */ 



1 
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MC Standard Algorithms -- PPC Macro language Version 



File Name: 
Description: 

Entry/params : 



ZDOTPR.K 

CPP Source code for Vector Single Precision 
Split Complex Dot Product 
ZDOTPR (A, I, B, J, C, N) 
ZIDOTPR (A, I, B, J, C, N) 

= sum (A->realp [ml] *B->realp [mJ] 

-/+ A->imagp [ml] *B- >imagp [mJ] ) 
= sum (A->realp [ml] *B->imagp [mJ] 

+/- A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-l 



Mercury Computer Systems, Inc. 
Copyright (c) 1998 All rights reserved 



Revision Date 



981215 
990310 
000131 
000223 
000717 



0.2 
0.3 
0.4 



Engineer Reason 
fpl Created 

fpl Integrated with 750 library 

j f k salppc . inc changes 

fpl Fixed pre -loop bug 

fpl Added dsts, removed LVXLs 



#include "salppc. inc" 



ESAL CPP definitions 



#undef 
#undef 
#undef 
#undef 
#undef 



FUNC CONJ ENTRY 
FUNC ENTRY 
LOAD A 
LOAD B 
SUFFIX 



#if defined ( VMX_SAL ) 

#define FUNC ENTRY zdotpr vmx 

#define FUNC CONJ ENTRY _zidotpr_vmx 

ttdefine LOAD A( vT, rA, rB ) LVX ( vT, rA, rB ) 

#define LOAD B( vT, rA, rB ) LVX ( vT, rA, rB ) 

#define SUFFIX ( label ) label 

#undef DSTA ( ptr, control ) 

#undef DSTB ( ptr, control ) 

#define DSTA ( ptr, control ) 

#define DSTB ( ptr, control ) 

#undef DST_ENABLE 

#elif defined ( VMX_NN ) 



#define 

#def ine 

#def ine 

#define 

#def ine 

#undef 

ttundef 

ttdefine 

tfdefine 

#undef 



zdotpr vmx nn 
idotpr_vmx_nn 



LOAD A( vT, rA, rB ) LVX ( vT, 

LOAD B( vT, rA, rB ) LVX ( vT, 
SUFFIX ( label ) label##_nn 

DSTA ( ptr, control ) 

DSTB ( ptr, control ) 

DSTA ( ptr, control ) 

DSTB ( ptr, control ) 
DST_ENABLE 



rA, rB ) 
rA, rB ) 



#elif defined ( VMX_NC ) 
#define FUNC_ENTRY 



zdotpr_vmx_nc 
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#define FUNC CONJ ENTRY _zidotpr_vmx_nc 

#define LOAD A( vT, rA, rB ) LVX ( vT, rA, rB ) 

#define LOAD B( vT, rA, rB ) LVX ( vT, rA, rB ) 

#define SUFFIX ( label ) label##_nc 

#undef DSTA( ptr, control ) 

#undef DSTB ( ptr, control ) 

#deflne DSTA ( ptr, control ) DST ( ptr, control, 0 
ADDI ( ptr, ptr, 64 ) 

#define DSTB ( ptr, control ) 



#define DST_ENABLE 
#elif defined ( VMX CN ) 



#def ine FUNC ENTRY zdotpr vmx cn 

#define FUNC CONJ ENTRY _zidotpr_vmx_cn 
#define LOAD A( vT, rA, rB ) LVX ( vT, rA, rB ) 
#define LOAD B( vT, rA, rB ) LVX( vT, rA, rB ) 
#define SUFFIX ( label ) label##_cn 
#undef DSTA ( ptr, control ) 
#undef DSTB ( ptr, control ) 
#define DSTA ( ptr, control ) 

#define DSTB ( ptr, control ) DST ( ptr, control, 0 ) 
ADDI ( ptr, ptr, 64 ) 

#define DST_ENABLE 
#elif defined ( VMX_CC ) 



//ttdefine FUNC ENTRY 

#define FUNC ENTRY 

#define FUNC CONJ ENTRY 

#define LOAD A( vT, rA, 

#define LOAD B( vT, rA, 

#define SUFFIX ( label ) 

#undef DSTA ( ptr, control ) 

#undef DSTB ( ptr, control ) 

#define DSTA ( ptr, control ) 

ttdefine DSTB ( ptr, control } 

#undef DST_ENABLE 



zdotpr vmx_cc 
zdotpr vmx 
_z idotpr_vmx_cc 
rB ) LVX( vT, rA, rB ) 
rB ) LVX( vT, rA, rB ) 
label##_cc 



: C or N 
#endif 

#define VREGSAVE_COND VRSAVE_COND /* defined as 7 in salppc.inc */ 
/** 

Local CPP definitions 
**/ 

#define NMASK2 0x8 

#define NMASK1 0x4 

ttdefine NSHIFT 4 

#define ADDRES S_INCREMENT 16 

/** 

Input args 
**/ 

#define A r3 
ttdefine I r4 
ttdefine B r5 
#define J r6 
ttdefine C r7 
ttdefine N r8 
ttdefine EFLAG r9 

/** 

Split complex parameters 
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ttdefine ArO A 
ttdefine AiO rlO 
ttdefine BrO B 
#define BiO rll 
#define Cr C 
#define Ci rl2 

/** 

Local registers 
**/ 

#define count r4 
#define rtmpO r4 
ttdefine rtmpl rl3 

#define dst stride rl3 
#define num_blocks rl4 
ttdefine Arl rl3 
ttdefine Ail rl4 
ttdefine Ar2 rl5 
ttdefine Ai2 rl6 
ttdefine Ar3 rl7 
ttdefine Ai3 rl8 

f"1 ttdefine Brl rl9 

K ttdefine Bil r20 

%y ttdefine Br2 r21 

%Q ttdefine Bi2 r22 

ift ttdefine Br3 r23 

% ttdefine Bi3 r24 

ttdefine ptr offsetO r25 

13 ttdefine ptr offsetl r26 

ff\ ttdefine addr incr r27 

ttdefine dst rptr r2 8 

Zi_ ttdefine dst iptr r2 9 

13 ttdefine dst_control r30 



w 



/** 

VMX registers 
**/ 

ttdefine rsumr vO 
ttdefine rsumi vl 
ttdefine isumr v2 
ttdefine isurai v3 
ttdefine rsumO v4 
ttdefine rsuml v5 
ttdefine isumO v6 
ttdefine isuml v7 



ttdefine 


arO 


v4 


ttdefine 


aiO 


v5 


ttdefine 


arl 


v6 


ttdefine 


ail 


v7 


ttdefine 


ar2 


v8 


ttdefine 


ai2 


v9 


ttdefine 


ar3 


vlO 


ttdefine 


ai3 


vll 


ttdefine 


brO 


vl2 


ttdefine 


biO 


vl3 


ttdefine 


brl 


vl4 


ttdefine 


bil 


vl5 


ttdefine 


br2 


vl6 


ttdefine 


bi2 


vl7 


ttdefine 


br3 


vl8 


ttdefine 


bi3 


vl9 


/** 
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FPU registers 









#def ine 


far 


f 0 


#def ine 


fbr 


f 1 


#define 


fai 


f2 


#def ine 


fbi 


f 3 


#def ine 


frsumr 


f4 


#def ine 


f rsumi 


f5 


#def ine 


f isumi 


f6 


#define 


f isumr 


f 7 


#def ine 


f rsum 


f8 


#def ine 


f isura 


f 9 


#def ine 


rsum vrnx flO 


#define 


isum_vmj 


c fll 


/** 







Begin code text, Save some registers 
Here for conjugate inner product 
**/ 

U_ENTRY( FUNC CONJ_ENTRY ) 
MR(rtmpO, Cr) 
MR ( Cr , Ci) 
MR (Ci , rtmpO) 
MR(rtmpO, BrO) 
MR (BrO , BiO) 
MR (BiO , rtmpO) 

/** 

Here for normal inner product 
**/ 

U_ENTRY( FUNC ENTRY ) 
DECLARE fO fll 
DECLARE r3 r30 
DECLARE_vO_Vl 9 

/** 

Initial setup code 
**/ 

SAVE rl3 r3 0 

USE THRU Vl9 ( VREGSAVE_COND ) 

LFS( frsumr, ArO, 0 ) 

FSUBS (frsumr, frsumr, frsumr) 

FMR ( f rsumi , frsumr) 

FMR(f isumr, frsumr) 

FMR (f isumi, frsumr) 

FMR (rsum vrnx, frsumr) 

FMR(isum_vmx, frsumr) 

/** 

Process unaligned vector section first 
**/ 

LABEL ( SUFFIX ( cont ) ) 

GET_VMX UNALIGNED COUNT ( count, ArO ) 
LI ( ptr of fsetO, 0 ) 
BEQ ( SUFFIX ( aligned ) ) 

SUB( N, N, count ) /* adjust N for after loop */ 

/** 

Here to do first 1 to 3 points using standard FP 
Store result for later post_loop processing 

LFSX( far, ArO, ptr offsetO ) 
LFSX( fai, AiO, ptr_offsetO ) 
DECR C( count ) 

LFSX( fbr, BrO, ptr offsetO ) 
LFSX( fbi, BiO, ptr_offsetO ) 
FMULS ( frsumr, far, fbr ) 
FMULS( f rsumi, fai, fbi ) 
FMULS ( fisumi, far, fbi ) 
FMULS ( f isumr, fai, fbr ) 
ADD I ( ArO, ArO, 4 ) 
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ADDI ( AiO, AiO, 4 ) 
ADDI ( BrO, BrO, 4 ) 
ADDI ( BiO, BiO, 4 ) 
BEQ ( SUFFIX ( aligned ) ) 
/** 

Loop does 1 or 2 more sum updates 
**/ 

LABEL ( SUFFIX ( pre_loop ) ) 

LFSX( far, ArO, ptr offsetO ) 
LFSX( fai, AiO, ptr_offsetO ) 
DECR C( count ) 



LFSX( fbr, 


BrO, 


ptr 


offsetO ) 


LFSX( fbi, 


BiO, 


ptr 


offsetO ) 


FMADDS ( frsumr, 


far, 


fbr. 


frsumr 


ADDI ( ArO , 


ArO, 


4 ) 






FMADDS ( fn 




fai, 


fbi, 


f rsumi 


ADDI ( AiO, 


Aio', 


4 ) 






FMADDS ( fi: 


sumi , 


far, 


fbi. 


fisumi 


ADDI ( BrO, 


BrO, 


4 ) 






FMADDS ( fii 




fai, 


fbr. 


fisumr 


ADDI ( BiO, 


BiO, 


4 ) 







BNE ( SUFFIX ( pre_loop) ) 
/** 

Here for VMX aligned loop code 

Prepare for loop entry: assign loop pointers, counters 
**/ 

LABEL ( SUFFIX ( aligned ) ) 
/** 

DST setup: bring in 2 cachelines 

MAKE STREAM_CODE( control_register, bytes_per_block, block_count, 
byte_stride ) 
**/ 

#if defined ( DST_ENABLE ) 

#if defined ( EXPAND_NCC ) 

MR ( dst rptr, Ar ) 

MR ( dst iptr, Ai ) 
#elif defined ( EXPAND_CNC ) 

MR ( dst rptr, Br ) 

MR( dst_iptr, Bi ) 
#endif 



MAKE STREAM CODE ( dst control, 
DSTA ( dst rptr, dst control ) 
DSTA ( dst iptr, dst control > 
DSTB ( dst rptr, dst control > 
DSTB ( dst_iptr, dst_control ) 
#endif 

SRWI C( count, N, NSHIFT ) 
LI (addr incr, ADDRESS INCREMENT) 
SLWKptr offsetl, addr incr, 2) 
NEG(ptr_offsetl, ptr_offsetl) 

ADD ( Arl , ArO, addr incr) 
VXOR( rsumr, rsumr, rsumr ) 
ADD (Brl, BrO, addr incr) 
ADD (Ail, AiO, addr incr) 
VXOR( rsumi, rsumi, rsumi ) 
ADD (Bil, BiO, addr_incr) 

ADD (Ar2 , Arl, addr incr) 
VXOR( isumr, isumr, isumr ) 
ADD (Br2 , Brl, addr incr) 
ADD (Ai2 , Ail, addr incr) 
VXOR( isumi, isumi, isumi ) 
ADD (Bi2 , Bil, addr_incr) 



64, 1, 0 ) 



16 per trip */ 

constants defined above */ 



/* will be adding addr_incr 
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ADD (Ar3 , Ar2 , addr incr) 
ADD (Br 3 , Br2, addr incr) 
ADD (Ai3 , Ai2, addr incr) 
ADD (Bi3 , Bi2, addr incr) 

SLWI (addr_incr, addr_incr, 3) /* bump by 8 elements */ 

/** 

Loop entry code 
**/ 

DSTA ( dst rptr, dst control ) 
LOAD A( arO, ArO, ptr offsetO ) 
DSTB ( dst rptr, dst control ) 
LOAD B( brO, BrO, ptr offsetO ) 
LOAD A( aiO, AiO, ptr offsetO ) 
LOAD_B( biO, BiO, ptr_offsetO ) 

/** 

Top of double loop structure 
**/ 

LABEL ( SUFFIX (loopO ) ) 

LOAD A( arl, Arl , ptr offsetO ) 
VMADDFP ( rsurar, arO, brO, rsumr ) 
DSTA ( dst iptr, dst control ) 
LOAD B( brl, Brl, ptr offsetO ) 
VMADDFP ( rsurai, aiO, biO, rsumi ) 
LOAD A( ail, Ail, ptr offsetO ) 
LOAD B( bil, Bil, ptr offsetO ) 
DSTB ( dst iptr, dst_control ) 
DECR C( count ) 

LOAD A( ar2, Ar2 , ptr offsetO ) 

VMADDFP ( isumi, arO, biO, isumi ) 

VMADDFP ( isumr, aiO, brO, isumr ) 

LOAD B( br2, Br2, ptr offsetO ) 

VMADDFP ( rsumr, arl, brl, rsumr ) 

ADD (ptr offsetl, ptr_offsetl, addr_incr) 

VMADDFP ( rsumi, ail, bil, rsumi ) 

LOAD A( ai2, Ai2, ptr offsetO ) 

VMADDFP ( isumi, arl, bil, isumi ) 

LOAD B( bi2, Bi2, ptr offsetO ) 

VMADDFP ( isumr, ail, brl, isumr ) 

VMADDFP ( rsumr, ar2, br2 , rsumr ) 

LOAD A( ar3, Ar3 , ptr offsetO ) 

VMADDFP ( rsumi, ai2, bi2, rsumi ) 

LOAD B( br3, Br3 , ptr offsetO ) 

LOAD A( ai3, Ai3, ptr offsetO ) 

VMADDFP ( isumi, ar2, bi2, isumi ) 

LOAD B( bi3, Bi3, ptr offsetO ) 

VMADDFP ( isumr, ai2, br2, isumr ) 

BEQ ( SUFFIX (loopO exit ) ) 

DSTA ( dst rptr, dst control ) 

LOAD A( arO, ArO, ptr offsetl ) 

VMADDFP ( rsumr, ar3 , br3 , rsumr ) 

VMADDFP ( rsumi, ai3, bi3, rsumi ) 

DSTB ( dst rptr, dst control ) 

LOAD B( brO, BrO, ptr offsetl ) 

VMADDFP ( isumi, ar3, bi3, isumi ) 

LOAD A( aiO, AiO, ptr offsetl ) 

LOAD B( biO, BiO, ptr offsetl ) 

VMADDFP ( isumr, ai3, br3 , isumr ) 

BR ( SUFFIX (loopl ) ) 

loop exit 
**/ 

LABEL ( SUFFIX (loopO exit ) ) 

MR (ptr offsetO, ptr offsetl) 
BR ( SUFFIX (loopl_exit ) ) 

/** 

Top of second loop 
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LABEL ( SUFFIX (loopl ) ) 

LOAD A( arl, Arl, ptr offsetl ) 
VMADDFP ( rsumr, arO, brO, rsumr ) 
DSTA{ dst iptr, dst control ) 
LOAD B( brl, Brl , ptr offsetl ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
LOAD A( ail, Ail, ptr offsetl ) 
LOAD B( bil, Bil, ptr offsetl ) 
DSTB ( dst iptr, dst_control ) 
DECR C( count ) 

LOAD A( ar2, Ar2 , ptr offsetl ) 
VMADDFP ( isumi, arO, biO, isumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
LOAD B( br2, Br2 , ptr offsetl ) 
VMADDFP ( rsumr, arl, brl, rsumr ) 
ADD (ptr offsetO, ptr_offsetO, addr_incr) 
VMADDFP ( rsumi, ail, bil, rsumi ) 
LOAD A( ai2, Ai2, ptr offsetl ) 
VMADDFP ( isumi, arl, bil, isumi ) 
LOAD B{ bi2, Bi2, ptr offsetl ) 
VMADDFP ( isumr, ail, brl, isumr ) 
VMADDFP ( rsumr, ar2, br2 , rsumr ) 
lis LOAD A( ar3, Ar3, ptr offsetl ) 

fk VMADDFP ( rsumi, ai2, bi2, rsumi ) 

111 LOAD B( br3, Br3 , ptr offsetl ) 

LOAD A( ai3, Ai3, ptr offsetl ) 
,f| VMADDFP ( isumi, ar2, bi2 , isumi ) 

^ LOAD B( bi3, Bi3, ptr offsetl ) 

™ VMADDFP ( isumr, ai2, br2 , isumr ) 

\S BEQ ( SUFFIX (loopl exit ) ) 

ffj DSTA ( dst rptr, dst control ) 

r LOAD A( arO, ArO, ptr offsetO ) 

VMADDFP ( rsumr, ar3, br3 , rsumr ) 
w VMADDFP ( rsumi, ai3, bi3, rsumi ) 

fh DSTB ( dst rptr, dst control ) 

i", LOAD B( brO, BrO, ptr offsetO ) 

VMADDFP ( isumi, ar3 , bi3 , isumi ) 
|* LOAD A( aiO, AiO, ptr offsetO ) 

t LOAD B( biO, BiO, ptr offsetO ) 

~£ VMADDFP ( isumr, ai3, br3 , isumr ) 

: < BR( SUFFIX (loopO ) ) 

Drop out of loop, flush pipe 
**/ 

LABEL ( SUFFIX (loopl exit ) ) 

VMADDFP ( rsumr, ar3, br3, rsumr ) 
VMADDFP ( rsumi, ai3, bi3, rsumi ) 
VMADDFP ( isumi, ar3 , bi3 , isumi ) 
VMADDFP ( isumr, ai3, br3 , isumr ) 

/** 

Remaining sum updates 
**/ 

LABEL ( SUFFIX (two_left) ) 

ANDI_C( count, N, 0x8 ) /* bit 3 */ 
BEQ ( SUFFIX (one_l eft ) ) 



LOAD 


A( 


arO, 


ArO, 


ptr 


offsetO 


LOAD 


B( 


brO, 


BrO, 


ptr 


offsetO 


LOAD 


A( 


aiO, 


AiO, 


ptr 


offsetO 


LOAD_ 


_B( 


biO, 


BiO, 


ptr_ 


_of f setO 


LOAD 


A( 


arl, 


Arl, 


ptr 


offsetO 


LOAD 


B( 


brl, 


Brl, 


ptr 


offsetO 


LOAD 


A( 


ail, 


Ail, 


ptr 


offsetO 


LOAD 


B( 


bil, 


Bil, 


ptr_ 


offsetO 
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VMADDFP { 




arO 


brO, 


rsumr 


VMADDFP ( 


rsumi , 


aiO 


biO, 


rsumi 


VMADDFP ( 




arO 


biO, 


isumi 


VMADDFP ( 


isumr, 


aiO 


brO, 


isumr 


VMADDFP ( 




arl 


brl. 


rsumr 


VMADDFP { 


rsumi , 


ail 


bil. 


rsumi 


VMADDFP { 


isurai , 


arl 


bil. 




VMADDFP ( 




ail 


brl. 


isumr 


ADDI ( ptr_offsetO, ptr_offsetO, 



LABEL ( SUFFIX (one_left) ) 

ANDI_C( count, N, 0x4 ) /* bit 2 */ 
BEQ ( SUFFIX (combine ) ) 
LOAD A( arO, ArO, ptr offsetO ) 
LOAD B( brO, BrO, ptr offsetO ) 
LOAD A( aiO, AiO, ptr offsetO ) 
LOAD B( biO, BiO, ptr offsetO ) 
VMADDFP ( rsumr, arO, brO, rsumr ) 
VMADDFP ( rsumi, aiO, biO, rsumi ) 
VMADDFP ( isumi, arO, biO, isumi ) 
VMADDFP ( isumr, aiO, brO, isumr ) 
ADDI ( ptr_offsetO, ptr_offsetO, 16 ) 

/** 

combine partial sums, permute, write out results 
**/ 

LABEL ( SUFFIX (combine) ) 

VSUBFP( rsumr, rsumr, rsumi ) /* rsumr = rsumr - ts\ 

VADDFP ( isumi, isumi, isumr ) 
/** 

8 bytes/cycle shuffle: 

real/imag logic should be intermixed for efficiency 



*/ 



r) 



VMRGHW ( rsumO , rsumr , 
ANDI C( addr incr, N, 0x3 ) 
VMRGHW ( i sumO , i sumi , i sumi ) 
VMRGLW (rsumi, rsumr, rsumr) 

SUB( addr incr, N, addr incr ) /* offset index for remainders */ 
VMRGLW (isumi, isumi, isumi) 
VADDFP ( rsumO, rsumi, rsumO ) 

SLWKaddr incr, addr incr, 2) /* byte offset */ 
VADDFP ( isumO, isumi, isumO ) 

VMRGHW ( rsumi , rsumO , rsumO ) 
ADD (ArO, ArO, addr incr) 
VMRGHW (isumi, isumO, isumO) 
ADD (AiO, AiO, addr incr) 
VMRGLW ( rsumO , rsumO, rsumO) 
ADD (BrO, BrO, addr incr) 
VMRGLW (isumO, isumO, isumO) 
ADD (BiO, BiO, addr incr) 
VADDFP ( rsumr, rsumi, rsumO ) 
LI (ptr offsetO, 0) /* needed for output */ 
VADDFP ( isumi, isumi, isumO ) 
/** 

4 byte stores 
**/ 

STVEWX ( rsumr, Cr, ptr offsetO ) 
STVEWXf isumi, Ci, ptr_offsetO ) 

Remainders of 1-3 more to do 
**/ 

ANDI_C( N, N, 3 ) 

LFS ( rsum vmx, Cr, 0 ) 

LFS( isum vmx, Ci, 0 ) 

BEQ ( SUFFIX ( scaler_vmx_combine ) ) 



/* 
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/** 

Here to do last 1-3 points using standard FP 
**/ 

LABEL ( SUFFIX ( post_loop ) ) 
LFS( far, ArO, 0 ) 
LFS( fai, AiO, 0 ) 
DECR_C( N ) 
LFS( fbr, BrO, 0 ) 
LFS( fbi, BiO, 0 ) 

FMADDS ( frsumr, far, fbr, frsumr ) 

FMADDS ( frsumi, fai, fbi, frsumi ) 

FMADDS ( fisumi, far, fbi, fisumi ) 

FMADDS ( fisumr, fai, fbr, fisumr ) 

ADDKArO, ArO, 4) 

ADDKBrO, BrO, 4) 

ADDI (AiO , AiO, 4) 

ADDI (BiO , BiO, 4) 

BNE ( SUFFIX ( post_loop) ) 

/** 

Write out result 
**/ 

LABEL ( SUFFIX ( scaler vmx combine ) ) 

FSUBS( frsum, frsumr, frsumi ) /* rsumr = rsi 

FADDS ( fisum, fisumi, fisumr ) 

FADDS ( frsum, frsum, rsum vmx ) 

FADDS ( fisum, fisum, isum_vmx ) 

STFS( frsum, Cr, 0 ) 

STFS( fisum, Ci, 0 ) 
/** 

return 
**/ 

LABEL ( SUFFIX (ret) ) 

FREE THRU vl9 ( VREGSAVE_COND ) 

REST rl3_r3 0 

RETURN 
FUNC EPILOG 
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#define ZDOTPR 0 
#define ZIDOTPR 1 



MC Standard Algorithms -- PPC Macro language Version 



File Name: 
Description 
Entry/ pa rams 



C[l] 



ZDOTPR. MAC 

Vector Single Precision Complex Dot Product 

, ZDOTPR (A, I, D, J, C, N) 

Formula: C[0] = sum (A->realp [ml] *B->realp [mJ] 

- A->imagp [ml] *B->imagp [mJ] ) 
sum (A- >realp [ml] *B- >imagp [mJ] 

+ A->imagp [ml] *B->realp [mJ] ) 
for m=0 to N-l 

Mercury Computer Systems, Inc. 
Copyright (c) 1998 All rights reserved 

Date Engineer Reason 

981209 fpl Created (from cdotpr.mac) 

990310 fpl 750/G4 integration 

990322 fpl Stylistic changes 



0.1 



#define COMPILE_ESAL_JUMP_TABLE 

#define FUNCJTYPE ZDOTPR 

#if definedf BUILD_MAX ) 

#undef VMX SAL 
#undef VMX NN 
#undef VMX NC 
#undef VMX CN 
#undef VMX_CC 

#if ! defined ( COMPILE ESAL_JUMP_TABLE ) | | defined ( 
COMPILE_NO_ESAL_JUMP_TABLE ) 

/* 1 variant: _zdotpr_vrax ( ) */ 



/* 5 variants based on ESAL flag */ 



#define VMX NN 
#include "zdotpr_vmx.k" 

#undef VMX NN 
ttdefine VMX NC 
#include "zdotpr_vmx.k« 

ttundef VMX NC 
#define VMX CN 
ttinclude "zdotpr_vmx.k" 

#undef VMX CN 
#define VMX CC 
#include "zdotpr_vmx .k" 
#undef VMX_CC 
#endif 

#endif 



end COMPILE_ESAL_JUMP_TABLE */ 
end BUILD_MAX */ 
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